Efficient Similarity Joins for Near Duplicate Detection

摘要：

With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records effiiently. In this paper, we focus on effiient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the effiiency. Experimental results show that our proposed algorithms can achieve up to 2.6x–5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.

关键词： similarity join near duplicate detection

作者: Chuan Xiao Wei Wang Xuemin Lin Jeffrey Xu Yu

作者单位: School of Computer Science and Engineering University of New South Wales Australia Department of Systems Engineering and Engineering Management Chinese University of Hong Kong Hong Ko

会议类型: 国际会议

会议名称: 第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-04-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Efficient Similarity Joins for Near Duplicate Detection