Research on Text Categorization of KNN Based on K-means for Class Imbalanced Problem
With the rapid development of Web and the rapid expansion of text information,how to effectively organize and manage these information is a great challenge for the current information science.Text automatic classification technology can effectively organize a large number of texts and help people to improve the efficiency of information retrieval.It has become one of the most important research directions in the field of information processing.There are many mature methods of text classification,where K-Nearest Neighbor algorithm has good accuracy,it is suitable for multiple classification problems and has been widely used in the field of document classification.However,when dealing with the training set with class imbalanced problem,the classification results tend to be biased towards majority class,so that the accuracy of the classifier is greatly reduced.In order to solve this problem,two strategies that construction of samples based on clustering and weighted KNN based on sample density are proposed in this paper to improve the traditional KNN algorithm.Four datasets which have different class imbalanced rates are extracted from the entire corpus,and we use classic KNN,NWKNN and Kmeans-KNN algorithm to perform cross validation on each dataset.The results show that compared with the traditional KNN algorithm and NWKNN algorithm,the proposed method can effectively improve the classification accuracy and G-mean value,and has better stability under the class imbalanced problem.
Text categorization K-Nearest Neighbor K-means Class imbalanced problem
Wang Yu Xu Linying
School Of Computer Science and Technology Tianjin University Tianjin, China
国际会议
哈尔滨
英文
579-583
2016-07-21(万方平台首次上网日期,不代表论文的发表时间)