A HIGH-PRECISION FORUM CRAWLER BASED ON VERTICAL CRAWLING
In this paper, we present a special crawler for Internet forums. Different from General Crawler and Focused Crawler, it can get structured information directly get the most valuable web resources by utilizing the least system resources, filter the useless information to the maximum extent and finally supply users with highprecision information. This crawler adopts templatebased processing method which is to use regular expressions to extract structured information. The URL queue is initialized by URLs set in seeds file and valuable URLs are extracted from web pages and added into the queue during the crawling process. Once the time of one post is beyond the specified time span or the web information is unchanged, the crawler can skip it in time to avoid wasting systems resources. Experimental results demonstrate that our crawler can collect real-time forum information more efficiently and precisely than other crawlers.
vertical crawler forum high-precision template structured information
Qing Gao Bo Xiao Zhiqing Lin Xiyao Chen Bing Zhou
Pattern Recognition and Intelligent System Laboratory(PRIS),Beijing University of Posts and Telecommunications, Beijing
国际会议
北京
英文
362-367
2009-11-06(万方平台首次上网日期,不代表论文的发表时间)