Classifier-guided topical crawler: a novel method of automatically labeling the positive URLs
It is a key factor for classifier-guided topical crawler to obtain labeled training samples. Recently, many such classifiers are trained with WebPages which are labeled manually or extracted from the Open Directory Project (ODP), and then the classifiers judge the topical relevance of WebPages pointed to by hyperlinks in the crawler frontier. Though one can obtain labeled WebPages with comparative ease, however, training the classifiers with WebPages violates the overall hypothesis of machine learning about i.i.d (Independent and Identical Distribution) between training and testing sets because the classification instances are hyperlinks (URLs) instead of WebPages. For the reason, this paper investigates and proposes a novel method based on templates for automatically labeling the positive URLs to develop classifier-guided topical crawlers. A series of off-line and on-line experiments are performed extensively. The results demonstrate that the classifier-guided topical crawler trained with labeled URLs has higher recall than the one trained with labeled WebPages. The results also prove that the classifier using immediate vicinity of hyperlinks and the corresponding anchor texts leads the crawler to attain harvest rate of about 87%.
SVM classifier topical crawler link context training set
CHEN Li LI Zhi-shu YU Zhong-hua HAN Guo-hui
College of computer science, sichuan university, Chengdu, Sichuan, China
国际会议
Fifth International Conference on Semantics,Knowledge and Grid(第五届语义、知识与网格国际会议 SKG 2009)
珠海
英文
270-273
2009-10-12(万方平台首次上网日期,不代表论文的发表时间)