An Approach of Web Page Information Extraction

摘要：

　　The Web has become the largest information source,but the noise content is an inevitable part in any web pages.The noise content reduces the nicety of search engine and increases the load of server.Information extraction technology has been developed.Information extraction technology is mostly based on page segmentation.Through analyzed the existing method of page segmentation,an approach of web page information extraction is provided.The block node is identified by analyzing attributes of HTML tags.This algorithm is easy to implementation.Experiments prove its good performance.

关键词： Information extraction DOM page segmentation HTML tag

作者: Yaohui Li Lixia Wang Jianxiong Wang Jie Yue Mingzhan Zhao

作者单位: Department of Computer Hebei Institute of Architecture and Civil Engineering Zhangjiakou City,China

会议类型: 国际会议

会议名称: 2013 2nd International Conference on Computer Science and Electronics Engineering(ICCSEE2013)(2013年第二届计算机科学与电子工程国际会议)

会议地点: 杭州

会议语种:英文

页码: 2218-2220

在线出版日期: 2013-03-22（万方平台首次上网日期，不代表论文的发表时间）

会议专题

An Approach of Web Page Information Extraction