Automatically Extracting Flip Link for Focused Crawling

摘要：

Flip links are widely used in web pages in some popular scenarios, which are very useful for many applications, such as focused crawling and query- result extraction. However, automatically extracting flip link blocks from web pages is not a trivial problem due to the diversity on both DOM tree structure and visual presentation. In this paper, we propose a practical solution for automatically extracting the flip link block from any web page. This solution consists of three steps: first segment each web page into a set of blocks based on the visual information, then identifies the block(s) that contain the flip link block using a machine learning approach, and finally extract the hyperlinks pointing to next pages from the identified block(s) by comparing the patterns in multiple blocks. Experimental results indicate that this technique is highly accurate.

关键词： Web data extraction Flip link Machine learning Visual feature

作者: Wei Liu

作者单位: Information Source Center Institute of Scientific and Technical Information of China Beijing, 100038

会议类型: 国际会议

会议名称: 2011 3rd International Conference on Computer Engineering and Applications(2011第三届计算机工程与应用国际会议 ICCEA2011)

会议地点: 海口

会议语种:英文

页码: 103-107

在线出版日期: 2011-07-15（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Automatically Extracting Flip Link for Focused Crawling