Crawling Result Pages for Data Extraction based on URL Classification
In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and automatically crawl query result pages from Web database. To address this problem, we propose a novel approach based on URL classification to effectively identify result pages. In our approach, we compute the similarity between URLs of hyperlinks in result pages and classify them into four categories. Each category maps to a set of similar web pages, which separate result pages from others. Then, we use the page probing method to verify the correctness of classification and improve the accuracy of crawled result pages. The experimental result demonstrates that our approach is effective for identifying the collection of result pages in Web database, and can improve the quality and efficiency of data extraction.
component data extraction result pages URL classification
Tiezheng Nie Zhenhua Wang Yue Kou Rui Zhang
Key Laboratory of Medical Image Computing, Northeastern University, Ministry of Education Shenyang C College of Information Science and Engineering, Northeastern University Shenyang China
国际会议
2010 Seventh Web Information System and Applications Conference(第七届全国web信息系统及其应用学术会议)
呼和浩特
英文
79-84
2010-08-20(万方平台首次上网日期,不代表论文的发表时间)