A Template-Level Approach for Automatic News Extraction
One primitive goal for news search engines is to provide users the most relevant news articles they want. To achieve this goal, the news sections (title, author, publishing date, news body, etc.) must firstly be extracted before further processing. Current approaches mainly consider the extraction of the news title and the news body and some templateindependent wrappers have already yielded accurate results. However, they do not support the extraction of more fine-grained sections, such as the author and the publishing date. In this paper, we present a template-level method for automatic news extraction on the basis of the DOM tree structure, which is able to extract fine-grained sections (author, publishing date. etc.). Moreover, It can also be used in other information extraction domains. To support our algorithm, we propose a convolution tree kernel called Template Similarity, which is used to measure the possibility of two DOM trees in sharing the same template. We tested our approach using 1000 pages crawled randomly from 10 world-famous news sites and obtained an accuracy of 97.8%.
Yujing Wang Yunhai Tong
School of Electronics Engineering and Computer Science,Peking University,Beijing 100871,China Key Laboratory of Machine Perception,Peking University,Beijing 100871,China
国际会议
太原
英文
133-138
2011-02-26(万方平台首次上网日期,不代表论文的发表时间)