A Template-Level Approach for Automatic News Extraction

摘要：

One primitive goal for news search engines is to provide users the most relevant news articles they want. To achieve this goal, the news sections (title, author, publishing date, news body, etc.) must firstly be extracted before further processing. Current approaches mainly consider the extraction of the news title and the news body and some templateindependent wrappers have already yielded accurate results. However, they do not support the extraction of more fine-grained sections, such as the author and the publishing date. In this paper, we present a template-level method for automatic news extraction on the basis of the DOM tree structure, which is able to extract fine-grained sections (author, publishing date. etc.). Moreover, It can also be used in other information extraction domains. To support our algorithm, we propose a convolution tree kernel called Template Similarity, which is used to measure the possibility of two DOM trees in sharing the same template. We tested our approach using 1000 pages crawled randomly from 10 world-famous news sites and obtained an accuracy of 97.8％.

作者: Yujing Wang Yunhai Tong

作者单位: School of Electronics Engineering and Computer Science,Peking University,Beijing 100871,China Key Laboratory of Machine Perception,Peking University,Beijing 100871,China

会议类型: 国际会议

会议名称: 2011 3rd International Conference on Computer and Network Technology(ICCNT 2011)(2011第三届IEEE计算机与网络技术国际会议)

会议地点: 太原

会议语种:英文

页码: 133-138

在线出版日期: 2011-02-26（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Template-Level Approach for Automatic News Extraction