会议专题

Automatically Extracting Flip Link for Focused Crawling

Flip links are widely used in web pages in some popular scenarios, which are very useful for many applications, such as focused crawling and query- result extraction. However, automatically extracting flip link blocks from web pages is not a trivial problem due to the diversity on both DOM tree structure and visual presentation. In this paper, we propose a practical solution for automatically extracting the flip link block from any web page. This solution consists of three steps: first segment each web page into a set of blocks based on the visual information, then identifies the block(s) that contain the flip link block using a machine learning approach, and finally extract the hyperlinks pointing to next pages from the identified block(s) by comparing the patterns in multiple blocks. Experimental results indicate that this technique is highly accurate.

Web data extraction Flip link Machine learning Visual feature

Wei Liu

Information Source Center Institute of Scientific and Technical Information of China Beijing, 100038

国际会议

2011 3rd International Conference on Computer Engineering and Applications(2011第三届计算机工程与应用国际会议 ICCEA2011)

海口

英文

103-107

2011-07-15(万方平台首次上网日期,不代表论文的发表时间)