Crawling Result Pages for Data Extraction based on URL Classification

摘要：

In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and automatically crawl query result pages from Web database. To address this problem, we propose a novel approach based on URL classification to effectively identify result pages. In our approach, we compute the similarity between URLs of hyperlinks in result pages and classify them into four categories. Each category maps to a set of similar web pages, which separate result pages from others. Then, we use the page probing method to verify the correctness of classification and improve the accuracy of crawled result pages. The experimental result demonstrates that our approach is effective for identifying the collection of result pages in Web database, and can improve the quality and efficiency of data extraction.

关键词： component data extraction result pages URL classification

作者: Tiezheng Nie Zhenhua Wang Yue Kou Rui Zhang

作者单位: Key Laboratory of Medical Image Computing, Northeastern University, Ministry of Education Shenyang C College of Information Science and Engineering, Northeastern University Shenyang China

会议类型: 国际会议

会议名称: 2010 Seventh Web Information System and Applications Conference(第七届全国web信息系统及其应用学术会议)

会议地点: 呼和浩特

会议语种:英文

页码: 79-84

在线出版日期: 2010-08-20（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Crawling Result Pages for Data Extraction based on URL Classification