CONTENT EXTRACTION FROM WEB PAGES BASED ON GAUSSIAN SMOOTHING
Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials. Content extraction is defined as the process of identifying the main content region and removing other materials. According to the different properties between Tag and Text nodes, we propose a general, accurate and efficient content extraction framework named Gaussian Smoothing Content Extractor (GSCE) to solve this problem. In addition, based on the identifying of main content, we also describe the extraction of Title and Published Date. According to the evaluation result using large data set, GSCE achieve a high precision and recall for most Web pages.
information retrieval content extraction Gaussian Smoothing DOM
Baohua Liao Bo Cheng Chuanchang Liu Junliang Cheng Gang Tan
State Key Laboratory of Networking and Switching TechnologyBeijing University of Posts & Telecommuni Guiyang Putian Logistics Technology Co.Ltd, Guiyang 550008, China
国际会议
北京
英文
42-47
2010-10-26(万方平台首次上网日期,不代表论文的发表时间)