会议专题

Data Integration in Deep-Web Form

  Over the past decade, the amount of data on the web has grown exponentially and most data can only be accessed through the search interfaces and provided in deep web forms.At present, the crawling deep-web data source has attracted more and more attention from academia and industry as its data becomes the foundation of many popular applications.However, for a long time, researchers and practitioners emphasize the effective of crawler but ignore the efficiency, i.e., the crawling time.The performance of crawlers generally measured by the crawling coverage over communication costs since the bandwidth is expensive at that time.Nowadays, with the improvement of the network environment, high throughput has become the basic requirement of crawlers because people need more timely and more comprehensive information collected by crawler.To address this issue, we propose an efficiently incremental crawling method,which is based on the cost model of coverage over time.The key to our method is to find the proper number of high-quality queries maximizing crawling coverage per unit time at each iteration.Here an elaborate greedy-based set-covering algorithm and an interval searching algorithm is used to generate multiple appropriate queries approaching the optimal result.Our method has been extensively tested on various data sources and compared with two state-of-the-art crawling methods.Through empirical research,the result show that our method is significantly superior to the two classic methods in the coverage-time model when the network is fast enough.In addition, our method can also perform very well in the cost mode of coverage over bandwidth consumption.

deep web query selection crawling efficiency document frequency

Yanhuan Tan Yan Wang Xin Jin Lubin Wang

School of Information Central University of Finance and Economics Beijing,China

国际会议

the 12th International Conference on Management of e-Commerce and e-Government( ICMeCG 2018) (第十二届电子商务与电子政务管理国际会议)

郑州

英文

366-373

2018-09-21(万方平台首次上网日期,不代表论文的发表时间)