会议专题

Content extraction from Chinese web page based on title and content dependency tree

  Content extraction is the basis of many other technologies about data mining,which aims to extract the worthiest information from data-intensive web pages full of noise.Traditional content extraction based on statistics cannot deal with short content documents,table text or documents with long comments.Thus,through the research of positional relation between title and content,the paper provides you with a new method to extract content of web pages,which constructs title and content dependency tree (TCDT),localizes a content with the smallest dependency distance and realizes the accurate extraction of web pages” contents by usage of dependency relation between title and content and the statistical information of pages.A number of experiments of several websites prove that it can not only make up for the deficiency of statistical method,but also has a better precision in extracting content.

Chinese information processing content extraction dependency distance title extraction link nodes

ZHANG Bin WANG Xiao-fei

School of Information and Communication Engineering,Beijing University of Posts and Telecommunications,Beijing 100876,China

国内会议

第六届中国传感器网络学术会议(CWSN 2012)

黄山

英文

147-151,189

2012-10-25(万方平台首次上网日期,不代表论文的发表时间)