Research on the categorization accuracy of different similarity measures on Chinese texts
This paper works on the most intensively studied algorithm k Nearest Neighbor algorithm. The purpose is to investigate the performance of different similarity measures in the kNN on Chinese texts. The two measures that we focus on are cosine value and JensenShannon Divergence. We use both the corpus collected from the Sogou, whose data extracts from the website of Sohu.com, and datasets that we have processed from real word. The results of our experiment indicate that difference of similarity metrics significantly affects the categorization accuracy.
Chinesetextcategorization Similarity SougouCorpus KNNalgorithm
Xiangdong LI Hangyu LIU Han JIA Li HUANG
The School of Information Management,Wuhan University, Wuhan 430072, China Wuhan University Library, Wuhan University, Wuhan 430072, China
国际会议
广州
英文
1-4
2011-05-13(万方平台首次上网日期,不代表论文的发表时间)