会议专题

Using Xpath to Discover Informative Content Blocks of Web Pages

Web pages usually contain various contents, which are relevant or irrelevant with the main topic. We define relevant contents as informative content blocks, whereas irrelevant contents as clutters. Clutters intend to mislead search engines, or trigger an artificially high link-based ranking for specific target pages. So cleaning Web pages before mining becomes critical for improving performance of traditional information retrieval. Here, we propose a method to discover informative content block without supervision. Initially, using a set of sample pages, we adopt a series of rules to distinguish informative content blocks from clutters. Then we generalize public XPath for informative content blocks or clutters, and apply it to similar pages. We have implemented our method in five differentWeb sites, and output more simpler and centralized HTML file. Experimental result shows that our method can obtain informative content blocks of Web page accurately. And another advantage of our approach is that it is completely automatic.

Yan Fu Dongqing Yang Shiwei Tang Tengjiao Wang Jun Gao

Peking University Beijing 100871, China

国际会议

2007年第三届语义和知识网格国际会议(Third International Conference on Semantics,Knowledge,and Grid)(SKG 2007)

西安

英文

2007-10-29(万方平台首次上网日期,不代表论文的发表时间)