会议专题

A HIGH-PRECISION FORUM CRAWLER BASED ON VERTICAL CRAWLING

In this paper, we present a special crawler for Internet forums. Different from General Crawler and Focused Crawler, it can get structured information directly get the most valuable web resources by utilizing the least system resources, filter the useless information to the maximum extent and finally supply users with highprecision information. This crawler adopts templatebased processing method which is to use regular expressions to extract structured information. The URL queue is initialized by URLs set in seeds file and valuable URLs are extracted from web pages and added into the queue during the crawling process. Once the time of one post is beyond the specified time span or the web information is unchanged, the crawler can skip it in time to avoid wasting systems resources. Experimental results demonstrate that our crawler can collect real-time forum information more efficiently and precisely than other crawlers.

vertical crawler forum high-precision template structured information

Qing Gao Bo Xiao Zhiqing Lin Xiyao Chen Bing Zhou

Pattern Recognition and Intelligent System Laboratory(PRIS),Beijing University of Posts and Telecommunications, Beijing

国际会议

2009 IEEE International Conference on Network Infrastructure and Digital Content(2009年IEEE网络基础设施与数字内容国际会议 IEEE IC-NIDC2009)

北京

英文

362-367

2009-11-06(万方平台首次上网日期,不代表论文的发表时间)