Characteristics and Uses of Labeled Datasets – ODP Case Study

摘要：

Labeled datasets are essential for text categorization. They are used to train a classifier, or as a benchmark collection to evaluate categorization algorithms. However, labeling a largescale document set is extremely expensive because it involves much human labour, and the labeling process itself is subjective rather than objective. Therefore, labels assigned to documents by only one human editor in some existing labeled document sets may be of limited use and may prove problematic for training a classifier or evaluating categorization algorithms. This research explores socially constructed Web directory, the Open Directory Project (ODP), to generate a series of labeled document sets by extracting semantic characteristics from the ODP categories which are annotated by a list of indexed Websites. The generated document sets are used to classify Web search results and the results are encouraging.

作者: Dengya Zhu Heinz Dreher

作者单位: School of Information Systems, Curtin University GPO Box U1987, Perth, WA, 6845, Australia

会议类型: 国际会议

会议名称: Sixth International Conference on Semantics,Knowledge and Grids(第六届语义、知识与网格国际会议 SKG 2010)

会议地点: 宁波

会议语种:英文

页码: 227-234

在线出版日期: 2010-11-01（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Characteristics and Uses of Labeled Datasets – ODP Case Study