Finding and Using the Content Texts of HTML Pages

摘要：

A novel algorithm to find the content text in an HTML page is proposed based on a number of features of textual blocks in the page.Experiments show the new algorithm is better than known ones in terms of the ratios of the correctly removed noise blocks and the correctly found content blocks respectively.The application of the algorithm in hidden web classification is demonstrated as well.

关键词： page clearning page segmentation content extraction

作者: Jun MA Zhumin Chen Li Lian Lianxia Li

作者单位: The Colledge of Computer Science and Technology,Shandong University,Jinan,China

会议类型: 国际会议

会议名称: 4th Asia Information Retrieval Symposium(AIRS 2008)(第四届亚洲信息检索研讨会)

会议地点: 哈尔滨

会议语种:英文

页码: 656-662

在线出版日期: 2008-01-16（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Finding and Using the Content Texts of HTML Pages