会议专题

Data Eztraction from Web Forums Based on Similarity of Page Layout

Web forums contain a wealth of information resources. Forum data can be widely used in areas such as Internet community mining, information retrieval and public opinion analysis and so on. This paper solves the problems of what should be extracted and how to extract from the web forums. Aimed at the limitation of current methods to extract data from web forums, an automated method is proposed to extract metadata from web forum pages. The method processes in two steps. We firstly recognizes the topic-block by making full use of the special layout of the web forum pages, then extract metadata from the topic-block by making use of statistical regularity of the metadata, the whole process done without manual work. Experimental results show that this method performs well both in adjustability and accuracy.

web forum data eztraction similarity

Yun WANG Bicheng LI Chen LIN

Information Processing Dept. Information Technology Institute Zhengzhou, China Information Processing Dept Information Technology Institute Zhengzhou, China

国际会议

International Conference on Natural Language Processing and Knowledge Engineering(IEEE自然语言处理与知识工程国际会议 IEEE NLP-KE 2009)

大连

英文

1-5

2009-09-24(万方平台首次上网日期,不代表论文的发表时间)