A feature selection simultaneously based on intra-category and extra-category for text categorization
Text categorization is an important means to process automatically the information which increases exponentially. But due to the high dimensionality of the text corpus, many sophisticated classifiers can not be efficiently and effectively used in text categorization. So feature selection has become a research focus in text categorization. In this paper, we proposed a new feature selection, named SIE, which simultaneously considers the number of documents that contain a feature in intra-category and extracategory. We compare the proposed method with four well known feature selections using two classification algorithms on two text corpora. The experiments show that the proposed method performs significantly better than information gain, orthogonal centroid feature selection and Poisson distribution, and produces comparable performance with χ2-statistic in terms of accuracy when Naive Bayes classifier and Support Vector machines are used.
feature selection text categorization dimensionality reduction
Zhiying Liu Jieming Yang
College of Information Engineering, Northeast Dianli University, Jilin, Jilin,China College of Information Engineering, Northeast Dianli University, Jilin, Jilin, China
国际会议
杭州
英文
413-416
2011-08-26(万方平台首次上网日期,不代表论文的发表时间)