Improve the Performance of the Webpage Content Extraction using Webpage Segmentation Algorithm
In this paper, we present a method using webpage segmentation algorithm to improve the performace of the webpage content extraction. The traditional methods often depend on parsing the DOM tree of the webpage and judging each node of the DOM tree to determin which node is the text node, this kind of method has a potential problem, it sometimes throws part of the content away because of its local judgement strategy. But our method which is based on the VIPS (Vision-based Page Segmentation) algorithm, can solve the problem satisfactorily, it can extract the content according to the coordinate information of the block and help the traditional method to recall the lost part of the content.
Webpage Segmentation Webpage Content Extraction DOM tree analysis VIPS
Fu Lei Meng Yao Yu Hao
Fujitsu R&D Center CO., LTD, Beijing, China, 100025
国际会议
重庆
英文
323-325
2009-12-25(万方平台首次上网日期,不代表论文的发表时间)