Near-optimal approximate duplicate-detection in data streams over sliding windows for the uniform query frequency or membership likelihood

摘要：

　　Approximate duplicate-detection(or membership query)in data streams answers the question of whether an element from a large universe U(a query element)is present in a small subsequence of a data stream or not.It is an important query that has many Internet applications,such as web crawling,social networks and so on.Existing approximate duplicatedetection methods in the sliding window model are not memoryefficient,since that they dont incorporate the information on the query frequencies and membership likelihoods of the elements in a large universe U into their data structure design,while the information can be obtained with well-developed technique.In this paper,assuming that either the query frequency or membership likelihood is uniform for all elements in U,we adopt a block-wise updating strategy to design an memory-efficient data structure,called cell Bloom filter(CEBF),and an approximate duplicate-detection algorithm based on CEBF.Suppose that the average false positive rate is and the sliding window size is n,then the number of bits used by our method is 2 log2(e)n(log2 1 +1),which is much less than those of other existing algorithms.Experimental results on synthetic data verify the effectiveness of our method.

作者: Xiujun Wang Xiao Zheng Zhe Dang Xuangou Wu Baohua Zhao

作者单位: Anhui University of Technology,Maanshan,Anhui 243032,China;University of Science and Technology of C Anhui University of Technology,Maanshan,Anhui 243032,China Washington State University,Pullman,WA 99164 University of Science and Technology of China,Hefei,Anhui 230027,China

会议类型: 国际会议

会议名称: 2014 2nd International Conference on Advanced Cloud and Big Data (CBD 2014)(2014年先进云计算和大数据国际会议)

会议地点: 安徽黄山

会议语种:英文

页码: 122-127

在线出版日期: 2014-11-20（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Near-optimal approximate duplicate-detection in data streams over sliding windows for the uniform query frequency or membership likelihood