The Research of Applying Support Vector Machine into Web Page Information Extraction Algorithm Based on Visual Characteristics
With the development of the Internet, the amount of information in web page is constantly increasing, information intensive degree is strengthened ceaselessly. But the theme of the web page information is usually not very clear, and extracting thematic information is very ifficult. This paper presents a new web page information extraction algorithm, in accordance with the theme web page visual characteristics to construct web page tag tree, analyze web page and split web page into blocks, eliminate noise node in web page. According to web pages index block and theme block characteristics difference and semantic difference use trained Support Vector Machine to classify and identify index blocks and theme blocks, then extract topic information of web pages. The experimental results show that, the application of support vector machine in the web page information extraction algorithm based on visual features is effective to identify theme web pages, complete the task of extracting text information of theme-orie nted web pages accurately, and achieve good experimental results.
Support Vector Machine Visual characteristics information extraction page splitting
JianJing Li ChunYing Zhang Xiao Chen ChunBo Li
Qianan College Hebei United University Qianan, Hebei, China Zhongxin Bank Tangshan, Hebei, China
国际会议
三峡
英文
2025-2027
2012-05-18(万方平台首次上网日期,不代表论文的发表时间)