会议专题

A feature selection simultaneously based on intra-category and extra-category for text categorization

Text categorization is an important means to process automatically the information which increases exponentially. But due to the high dimensionality of the text corpus, many sophisticated classifiers can not be efficiently and effectively used in text categorization. So feature selection has become a research focus in text categorization. In this paper, we proposed a new feature selection, named SIE, which simultaneously considers the number of documents that contain a feature in intra-category and extracategory. We compare the proposed method with four well known feature selections using two classification algorithms on two text corpora. The experiments show that the proposed method performs significantly better than information gain, orthogonal centroid feature selection and Poisson distribution, and produces comparable performance with χ2-statistic in terms of accuracy when Naive Bayes classifier and Support Vector machines are used.

feature selection text categorization dimensionality reduction

Zhiying Liu Jieming Yang

College of Information Engineering, Northeast Dianli University, Jilin, Jilin,China College of Information Engineering, Northeast Dianli University, Jilin, Jilin, China

国际会议

2011 Third International Conference on Intelligent Human-Machine Systems and Cybernetics 第三届智能人机系统与控制论国际会议 IHMSC 2011

杭州

英文

413-416

2011-08-26(万方平台首次上网日期,不代表论文的发表时间)