A Generic Web News Extraction Approach

摘要：

With the development of the Internet,the Web is becoming the largest data repository ever available in the history of humankind.Major efforts have been made in order to provide efficient access to relevant information within the web pages.Most previous works rely on the template of the web sites.When information like news needs to be extracted from different sites,it must create a template for every site which will spend much time and huge cost.In this paper,we present a generic news extraction method to easily identify news content based on a set of combined heuristics and to exact every part of news according to a predefined schema.Experimental results indicate that our approach is effective in extracting news across websites.

作者: Yongquan Dong Qingzhong Li Zhongmin Yan Yanhui Ding

作者单位: School of Computer Science and Technology,Shandong University,Jinan,P.R.China;School of Computer Sci School of Computer Science and Technology,Shandong University,Jinan,P.R.China

会议类型: 国际会议

会议名称: 2008 IEEE International Conference on Onformation and Automation(IEEE 信息与自动化国际会议)

会议地点: 张家界

会议语种:英文

页码: 179-183

在线出版日期: 2008-06-20（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Generic Web News Extraction Approach