会议专题

Research on the categorization accuracy of different similarity measures on Chinese texts

This paper works on the most intensively studied algorithm k Nearest Neighbor algorithm. The purpose is to investigate the performance of different similarity measures in the kNN on Chinese texts. The two measures that we focus on are cosine value and JensenShannon Divergence. We use both the corpus collected from the Sogou, whose data extracts from the website of Sohu.com, and datasets that we have processed from real word. The results of our experiment indicate that difference of similarity metrics significantly affects the categorization accuracy.

Chinesetextcategorization Similarity SougouCorpus KNNalgorithm

Xiangdong LI Hangyu LIU Han JIA Li HUANG

The School of Information Management,Wuhan University, Wuhan 430072, China Wuhan University Library, Wuhan University, Wuhan 430072, China

国际会议

2011 International Conference on Business Management and Electronic Information(2011商业管理与电子信息国际学术会议 BMEI2011)

广州

英文

1-4

2011-05-13(万方平台首次上网日期,不代表论文的发表时间)