Data Eztraction from Web Forums Based on Similarity of Page Layout
Web forums contain a wealth of information resources. Forum data can be widely used in areas such as Internet community mining, information retrieval and public opinion analysis and so on. This paper solves the problems of what should be extracted and how to extract from the web forums. Aimed at the limitation of current methods to extract data from web forums, an automated method is proposed to extract metadata from web forum pages. The method processes in two steps. We firstly recognizes the topic-block by making full use of the special layout of the web forum pages, then extract metadata from the topic-block by making use of statistical regularity of the metadata, the whole process done without manual work. Experimental results show that this method performs well both in adjustability and accuracy.
web forum data eztraction similarity
Yun WANG Bicheng LI Chen LIN
Information Processing Dept. Information Technology Institute Zhengzhou, China Information Processing Dept Information Technology Institute Zhengzhou, China
国际会议
大连
英文
1-5
2009-09-24(万方平台首次上网日期,不代表论文的发表时间)