Web Informative Content Block Detecting Based on Entropy and Parent-Child Relationship in DOM
To increase the commercial value and accessibility of pages,most sites tend to publish their pages with redundant information,such as navigation panels,advertisements,and copyright announcements.Such redundant information almost exists in all pages of the website,which increases the index size of general search engines and causes page topics to drift.In this paper,We propose an informative content blocks detecting system called WICBDPCR(Web Informative Content Block Detecting based on Parent-Child Relationship in the document object model)which applies Information Theory to DOM tree in order to detect the informative structure.Experiments on several real commercial Web sites show high precision and recall rates,which validate WICBDPCR s practical applicability .
Yanhui Ding Qingzhong Li Zhongmin Yan Yongquan Dong
School of Computer Science and Technology Shandong University Jinan,Shandong Province,P.R.China
国际会议
2008 IEEE International Conference on Onformation and Automation(IEEE 信息与自动化国际会议)
张家界
英文
175-178
2008-06-20(万方平台首次上网日期,不代表论文的发表时间)