Duplicate Identification in Deep Web Data Integration

摘要：

Duplicate identification is a critical step in deep web data integration, and generally, this task has to be performed over multiple web databases. However, a customized matcher for two web databases often does not work well for other two ones due to various presentations and different schemas. It is not practical to build and maintain Cn2 matchers for n web databases. In this paper, we target at building one universal matcher over multiple web databases in one domain. According to our observation, the similarity on an attribute is dependent of those of some other attributes, which is ignored by existing approaches. Inspired by this, we propose a comprehensive solution for duplicate identification problem over multiple web databases. The extensive experiments over real web databases on three domains show the proposed solution is an effective way to address the duplicate identification problem over multiple web databases.

关键词： duplicate identification deep web data integration web database

作者: Wei Liu Xiaofeng Meng Jianwu Yang Jianguo Xiao

作者单位: Institute of Computer Science & Technology, Peking University, Beijing, China School of Information, Renmin University of China, Beijing, China

会议类型: 国际会议

会议名称: 11th International Conference,WAIM 2010(第十一届网络时代管理国际会议)

会议地点: 九寨沟

会议语种:英文

页码: 5-17

在线出版日期: 2010-07-14（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Duplicate Identification in Deep Web Data Integration