Efficient Duplicate Record Detection Based on Similarity Estimation

摘要：

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O( 1 ) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.

关键词： Heterogeneous Records Duplicate Detection Record Similarity Similarity Estimation

作者: Mohan Li Hongzhi Wang Jianzhong Li Hong Gao

作者单位: Harbin Institute of Technology, Harbin 150001

会议类型: 国际会议

会议名称: 11th International Conference,WAIM 2010(第十一届网络时代管理国际会议)

会议地点: 九寨沟

会议语种:英文

页码: 595-607

在线出版日期: 2010-07-14（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Efficient Duplicate Record Detection Based on Similarity Estimation