会议专题

CONTENT EXTRACTION FROM WEB PAGES BASED ON GAUSSIAN SMOOTHING

Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials. Content extraction is defined as the process of identifying the main content region and removing other materials. According to the different properties between Tag and Text nodes, we propose a general, accurate and efficient content extraction framework named Gaussian Smoothing Content Extractor (GSCE) to solve this problem. In addition, based on the identifying of main content, we also describe the extraction of Title and Published Date. According to the evaluation result using large data set, GSCE achieve a high precision and recall for most Web pages.

information retrieval content extraction Gaussian Smoothing DOM

Baohua Liao Bo Cheng Chuanchang Liu Junliang Cheng Gang Tan

State Key Laboratory of Networking and Switching TechnologyBeijing University of Posts & Telecommuni Guiyang Putian Logistics Technology Co.Ltd, Guiyang 550008, China

国际会议

2010 3rd IEEE International Conference on Broadband Network & Multimedia Technology(2010年第三届IEEE宽带网络与多媒体国际会议 IC-BNMT 2010)

北京

英文

42-47

2010-10-26(万方平台首次上网日期,不代表论文的发表时间)