Finding and Using the Content Texts of HTML Pages
A novel algorithm to find the content text in an HTML page is proposed based on a number of features of textual blocks in the page.Experiments show the new algorithm is better than known ones in terms of the ratios of the correctly removed noise blocks and the correctly found content blocks respectively.The application of the algorithm in hidden web classification is demonstrated as well.
page clearning page segmentation content extraction
Jun MA Zhumin Chen Li Lian Lianxia Li
The Colledge of Computer Science and Technology,Shandong University,Jinan,China
国际会议
4th Asia Information Retrieval Symposium(AIRS 2008)(第四届亚洲信息检索研讨会)
哈尔滨
英文
656-662
2008-01-16(万方平台首次上网日期,不代表论文的发表时间)