Efficient Duplicate Record Detection Based on Similarity Estimation
In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O( 1 ) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.
Heterogeneous Records Duplicate Detection Record Similarity Similarity Estimation
Mohan Li Hongzhi Wang Jianzhong Li Hong Gao
Harbin Institute of Technology, Harbin 150001
国际会议
11th International Conference,WAIM 2010(第十一届网络时代管理国际会议)
九寨沟
英文
595-607
2010-07-14(万方平台首次上网日期,不代表论文的发表时间)