会议专题

A Template-based Method for Theme Information Extraction from Web Pages

The introducing web page templates and DOM technology can effectively extract simple structured information from web information. In reference to previous research achievements of the foundation, this paper presents a new method of inductive web page templates. This method is able to contain various layout elements of the web page templates. The main research contents include the methods based on edit distance, about DOM document similarity judgment, clustering methods focus on web structure, the extraction methods of web page templates and programming a information extraction engine.

web extraction template method page ssimilarity web clustering

Gui-Sheng Yin Guang-Dong Guo Jing-Jing Sun

Department of Computer Science and Technology Harbin Engineering University

国际会议

The 2010 International Conference on Computer Application and System Modeling(2010计算机应用与系统建模国际会议 ICCASM 2010)

太原

英文

721-725

2010-10-22(万方平台首次上网日期,不代表论文的发表时间)