A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

摘要：

The proliferation of deep Web offers users a great opportunity to search high-quality information from Web. As a necessary step in deep Web data integration, the goal of duplicate entity identification is to discover the duplicate records from the integrated Web databases for further applications(e.g. price-comparison services). However, most of existing works address this issue only between two data sources, which are not practical to deep Web data integration systems. That is, one duplicate entity matcher trained over two specific Web databases cannot be applied to other Web databases. In addition, the cost of preparing the training set for n Web databases is times higher than that for two Web databases. In this paper, we propose a holistic solution to address the new challenges posed by deep Web, whose goal is to build one duplicate entity matcher over multiple Web databases. The extensive experiments on two domains show that the proposed solution is highly effective for deep Web data integration.

作者: Wei Liu Xiaofeng Meng

作者单位: Institute of Computer Science and Technology, Peking University, Beijing 100871, China Institute of School of Information, Renmin University of China, Beijing 100872, China

会议类型: 国际会议

会议名称: Sixth International Conference on Semantics,Knowledge and Grids(第六届语义、知识与网格国际会议 SKG 2010)

会议地点: 宁波

会议语种:英文

页码: 267-274

在线出版日期: 2010-11-01（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration