Classifier-guided topical crawler: a novel method of automatically labeling the positive URLs

摘要：

It is a key factor for classifier-guided topical crawler to obtain labeled training samples. Recently, many such classifiers are trained with WebPages which are labeled manually or extracted from the Open Directory Project (ODP), and then the classifiers judge the topical relevance of WebPages pointed to by hyperlinks in the crawler frontier. Though one can obtain labeled WebPages with comparative ease, however, training the classifiers with WebPages violates the overall hypothesis of machine learning about i.i.d (Independent and Identical Distribution) between training and testing sets because the classification instances are hyperlinks (URLs) instead of WebPages. For the reason, this paper investigates and proposes a novel method based on templates for automatically labeling the positive URLs to develop classifier-guided topical crawlers. A series of off-line and on-line experiments are performed extensively. The results demonstrate that the classifier-guided topical crawler trained with labeled URLs has higher recall than the one trained with labeled WebPages. The results also prove that the classifier using immediate vicinity of hyperlinks and the corresponding anchor texts leads the crawler to attain harvest rate of about 87％.

关键词： SVM classifier topical crawler link context training set

作者: CHEN Li LI Zhi-shu YU Zhong-hua HAN Guo-hui

作者单位: College of computer science, sichuan university, Chengdu, Sichuan, China

会议类型: 国际会议

会议名称: Fifth International Conference on Semantics,Knowledge and Grid(第五届语义、知识与网格国际会议 SKG 2009)

会议地点: 珠海

会议语种:英文

页码: 270-273

在线出版日期: 2009-10-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Classifier-guided topical crawler: a novel method of automatically labeling the positive URLs