Automatically Extracting Flip Link for Focused Crawling
Flip links are widely used in web pages in some popular scenarios, which are very useful for many applications, such as focused crawling and query- result extraction. However, automatically extracting flip link blocks from web pages is not a trivial problem due to the diversity on both DOM tree structure and visual presentation. In this paper, we propose a practical solution for automatically extracting the flip link block from any web page. This solution consists of three steps: first segment each web page into a set of blocks based on the visual information, then identifies the block(s) that contain the flip link block using a machine learning approach, and finally extract the hyperlinks pointing to next pages from the identified block(s) by comparing the patterns in multiple blocks. Experimental results indicate that this technique is highly accurate.
Web data extraction Flip link Machine learning Visual feature
Wei Liu
Information Source Center Institute of Scientific and Technical Information of China Beijing, 100038
国际会议
海口
英文
103-107
2011-07-15(万方平台首次上网日期,不代表论文的发表时间)