Web Page Sectioning Using Regex-based Template
This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effictiveness of the approach.
Site-specic segmentation Site-specific noise elimination Tree-based reg-ex
Rupesh R. Mehta Amit Madaan
Yahoo! R&D Bangalore, India
国际会议
第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)
北京
英文
2008-04-21(万方平台首次上网日期,不代表论文的发表时间)