Using Xpath to Discover Informative Content Blocks of Web Pages

摘要：

Web pages usually contain various contents, which are relevant or irrelevant with the main topic. We define relevant contents as informative content blocks, whereas irrelevant contents as clutters. Clutters intend to mislead search engines, or trigger an artificially high link-based ranking for specific target pages. So cleaning Web pages before mining becomes critical for improving performance of traditional information retrieval. Here, we propose a method to discover informative content block without supervision. Initially, using a set of sample pages, we adopt a series of rules to distinguish informative content blocks from clutters. Then we generalize public XPath for informative content blocks or clutters, and apply it to similar pages. We have implemented our method in five differentWeb sites, and output more simpler and centralized HTML file. Experimental result shows that our method can obtain informative content blocks of Web page accurately. And another advantage of our approach is that it is completely automatic.

作者: Yan Fu Dongqing Yang Shiwei Tang Tengjiao Wang Jun Gao

作者单位: Peking University Beijing 100871, China

会议类型: 国际会议

会议名称: 2007年第三届语义和知识网格国际会议(Third International Conference on Semantics,Knowledge,and Grid)(SKG 2007)

会议地点: 西安

会议语种:英文

在线出版日期: 2007-10-29（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Using Xpath to Discover Informative Content Blocks of Web Pages