会议专题

Research on Text Categorization of KNN Based on K-means for Class Imbalanced Problem

  With the rapid development of Web and the rapid expansion of text information,how to effectively organize and manage these information is a great challenge for the current information science.Text automatic classification technology can effectively organize a large number of texts and help people to improve the efficiency of information retrieval.It has become one of the most important research directions in the field of information processing.There are many mature methods of text classification,where K-Nearest Neighbor algorithm has good accuracy,it is suitable for multiple classification problems and has been widely used in the field of document classification.However,when dealing with the training set with class imbalanced problem,the classification results tend to be biased towards majority class,so that the accuracy of the classifier is greatly reduced.In order to solve this problem,two strategies that construction of samples based on clustering and weighted KNN based on sample density are proposed in this paper to improve the traditional KNN algorithm.Four datasets which have different class imbalanced rates are extracted from the entire corpus,and we use classic KNN,NWKNN and Kmeans-KNN algorithm to perform cross validation on each dataset.The results show that compared with the traditional KNN algorithm and NWKNN algorithm,the proposed method can effectively improve the classification accuracy and G-mean value,and has better stability under the class imbalanced problem.

Text categorization K-Nearest Neighbor K-means Class imbalanced problem

Wang Yu Xu Linying

School Of Computer Science and Technology Tianjin University Tianjin, China

国际会议

2016 Sixth International Conference on Instrumentation and Measurement,Computer,Communication and Control (IMCCC2016)(第六届仪器测量、计算机通信与控制国际会议)

哈尔滨

英文

579-583

2016-07-21(万方平台首次上网日期,不代表论文的发表时间)