会议专题

Improve the Performance of the Webpage Content Extraction using Webpage Segmentation Algorithm

In this paper, we present a method using webpage segmentation algorithm to improve the performace of the webpage content extraction. The traditional methods often depend on parsing the DOM tree of the webpage and judging each node of the DOM tree to determin which node is the text node, this kind of method has a potential problem, it sometimes throws part of the content away because of its local judgement strategy. But our method which is based on the VIPS (Vision-based Page Segmentation) algorithm, can solve the problem satisfactorily, it can extract the content according to the coordinate information of the block and help the traditional method to recall the lost part of the content.

Webpage Segmentation Webpage Content Extraction DOM tree analysis VIPS

Fu Lei Meng Yao Yu Hao

Fujitsu R&D Center CO., LTD, Beijing, China, 100025

国际会议

2009 International Forum on Computer Science-Technology and Applications(2009年国际计算机科学技术与应用论坛 IFCSTA 2009)

重庆

英文

323-325

2009-12-25(万方平台首次上网日期,不代表论文的发表时间)