A WEB PAGE CONTENT INFORMATION EXTRACTION METHOD BASED ON TAG WINDOW
This paper proposed a web content information extraction method based on Tag Window, which can deal with some special circumstances. All the web content information was put into one td or several tds, and the character numbers of web content information were at most equal to that of the other information, navigation bars, advertisement, and the copyright, etc. Most especially, it can extract the web content information which is not existed as the table format.Experiments showed that this method could improve the accuracy of the web content information extraction and had a wide applicability.
Tag window extraction DOM
XIN-XIN ZHAO HONG-GUANG SUO YU-SHU LIU
Department of Computer Science & Engineering, Beijing Institute of Technology, Beijing, 100081, China
国际会议
2006 International Conference on Machine Learning and Cybernetics(IEEE第五届机器学习与控制论坛)
大连
英文
1598-1601
2006-08-13(万方平台首次上网日期,不代表论文的发表时间)