Improve the Performance of the Webpage Content Extraction using Webpage Segmentation Algorithm

摘要：

In this paper, we present a method using webpage segmentation algorithm to improve the performace of the webpage content extraction. The traditional methods often depend on parsing the DOM tree of the webpage and judging each node of the DOM tree to determin which node is the text node, this kind of method has a potential problem, it sometimes throws part of the content away because of its local judgement strategy. But our method which is based on the VIPS (Vision-based Page Segmentation) algorithm, can solve the problem satisfactorily, it can extract the content according to the coordinate information of the block and help the traditional method to recall the lost part of the content.

关键词： Webpage Segmentation Webpage Content Extraction DOM tree analysis VIPS

作者: Fu Lei Meng Yao Yu Hao

作者单位: Fujitsu R&D Center CO., LTD, Beijing, China, 100025

会议类型: 国际会议

会议名称: 2009 International Forum on Computer Science-Technology and Applications(2009年国际计算机科学技术与应用论坛 IFCSTA 2009)

会议地点: 重庆

会议语种:英文

页码: 323-325

在线出版日期: 2009-12-25（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Improve the Performance of the Webpage Content Extraction using Webpage Segmentation Algorithm