Content Extraction from Web Pages Based on Chinese Punctuation Number

摘要：

Extracting main content from web page is the preprocessing of web information system. The content extraction approach based on wrapper is limited to one specific information source, and greatly depends on web page structure. It is seldom employed in practice. A new content extraction method is thus proposed in this paper, which can discover web page content according to the number of Chinese punctuations and the ratio of non-hyperlink character number to character number that hyperlinks contain. It can eliminate noise and extract main content blocks from web page effectively. Experimental results show that this approach is accurate and suitable for most Chinese web sites.

关键词： content extraction wrapper HTML tree web page noise

作者: Mingqiu Song Institute of System Engineering

作者单位: Institute of System Engineering Dalian University of Technology Dalian, China

会议类型: 国际会议

会议名称: 第三届IEEE无线通讯、网络技术暨移动计算国际会议

会议地点: 上海

会议语种:英文

在线出版日期: 2007-09-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Content Extraction from Web Pages Based on Chinese Punctuation Number