The Research of Applying Support Vector Machine into Web Page Information Extraction Algorithm Based on Visual Characteristics

摘要：

With the development of the Internet, the amount of information in web page is constantly increasing, information intensive degree is strengthened ceaselessly. But the theme of the web page information is usually not very clear, and extracting thematic information is very ifficult. This paper presents a new web page information extraction algorithm, in accordance with the theme web page visual characteristics to construct web page tag tree, analyze web page and split web page into blocks, eliminate noise node in web page. According to web pages index block and theme block characteristics difference and semantic difference use trained Support Vector Machine to classify and identify index blocks and theme blocks, then extract topic information of web pages. The experimental results show that, the application of support vector machine in the web page information extraction algorithm based on visual features is effective to identify theme web pages, complete the task of extracting text information of theme-orie nted web pages accurately, and achieve good experimental results.

关键词： Support Vector Machine Visual characteristics information extraction page splitting

作者: JianJing Li ChunYing Zhang Xiao Chen ChunBo Li

作者单位: Qianan College Hebei United University Qianan, Hebei, China Zhongxin Bank Tangshan, Hebei, China

会议类型: 国际会议

会议名称: 2012 International Conference on Electric Technology and Civil Engineering(2012 电子技术与土木工程国际会议 ICETCE 2012)

会议地点: 三峡

会议语种:英文

页码: 2025-2027

在线出版日期: 2012-05-18（万方平台首次上网日期，不代表论文的发表时间）

会议专题

The Research of Applying Support Vector Machine into Web Page Information Extraction Algorithm Based on Visual Characteristics