An Improvement Method of Duplicate Webpage Detection
As Internet is very easy to implement the diffusing and sharing of resources,duplication of pages on the Internet is very large.The search engine as an index tool of Internet resources is facing a serious repeat testing,its crawler will encounter a large number of links of duplicate content.If these links are all added to the download queue,it will cause a serious drop in performance and this would seriously affect the user experience.In this paper,we adopt an improved duplicate detection method------using BloomFilter combining with fuzzy Hamming distance.This will not only meet the detection of duplicate content,but also h will meet the needs of users.
search engine duplicate detection BloomFilter Fuzzy Hamming distance
Chengqi Zhang Wenqian Shang Yafeng Li
Department of Computer Sciences,Communication University of China,China
国际会议
沈阳
英文
27-30
2012-09-26(万方平台首次上网日期,不代表论文的发表时间)