CONTENT EXTRACTION FROM WEB PAGES BASED ON GAUSSIAN SMOOTHING

摘要：

Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials. Content extraction is defined as the process of identifying the main content region and removing other materials. According to the different properties between Tag and Text nodes, we propose a general, accurate and efficient content extraction framework named Gaussian Smoothing Content Extractor (GSCE) to solve this problem. In addition, based on the identifying of main content, we also describe the extraction of Title and Published Date. According to the evaluation result using large data set, GSCE achieve a high precision and recall for most Web pages.

关键词： information retrieval content extraction Gaussian Smoothing DOM

作者: Baohua Liao Bo Cheng Chuanchang Liu Junliang Cheng Gang Tan

作者单位: State Key Laboratory of Networking and Switching TechnologyBeijing University of Posts & Telecommuni Guiyang Putian Logistics Technology Co.Ltd, Guiyang 550008, China

会议类型: 国际会议

会议名称: 2010 3rd IEEE International Conference on Broadband Network & Multimedia Technology(2010年第三届IEEE宽带网络与多媒体国际会议 IC-BNMT 2010)

会议地点: 北京

会议语种:英文

页码: 42-47

在线出版日期: 2010-10-26（万方平台首次上网日期，不代表论文的发表时间）

会议专题

CONTENT EXTRACTION FROM WEB PAGES BASED ON GAUSSIAN SMOOTHING