Semantic Keywords-Based Duplicated Web Pages Removing

摘要：

Because of many duplicated web pages existing on the web, search engines need to find and remove them, not only for saving process time and hardware resource, but also for ensuring that users can get the result information without many replicas. In this paper, we propose a method to find and remove duplicated Chinese web pages for search engine. First we describe a scheme based on semantic keywords combined with sentence overlapping, and then show an implemented prototype, with the experimental results that suggest the prototype work well under a proper setting.

关键词： Duplicated web pages semantic keywords IR

作者: Yunhe Weng Lei Li Yixin Zhong

作者单位: School of Information Engineering,Beijing University of Posts and Tele-communications Beijing,China

会议类型: 国际会议

会议名称: The 2008 IEEE International Conference on Natural Language Processing and Knowledge Engineering(IEEE NLP-KE 2008)(2008IEEE自然语言处理与知识工程国际会议)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-10-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Semantic Keywords-Based Duplicated Web Pages Removing