A WEB PAGE CONTENT INFORMATION EXTRACTION METHOD BASED ON TAG WINDOW

摘要：

This paper proposed a web content information extraction method based on Tag Window, which can deal with some special circumstances. All the web content information was put into one td or several tds, and the character numbers of web content information were at most equal to that of the other information, navigation bars, advertisement, and the copyright, etc. Most especially, it can extract the web content information which is not existed as the table format.Experiments showed that this method could improve the accuracy of the web content information extraction and had a wide applicability.

关键词： Tag window extraction DOM

作者: XIN-XIN ZHAO HONG-GUANG SUO YU-SHU LIU

作者单位: Department of Computer Science & Engineering, Beijing Institute of Technology, Beijing, 100081, China

会议类型: 国际会议

会议名称: 2006 International Conference on Machine Learning and Cybernetics(IEEE第五届机器学习与控制论坛)

会议地点: 大连

会议语种:英文

页码: 1598-1601

在线出版日期: 2006-08-13（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A WEB PAGE CONTENT INFORMATION EXTRACTION METHOD BASED ON TAG WINDOW