AN IMPROVED DOCUMENT CLASSIFICATION APPROACH WITH MAXIMUM ENTROPY AND ENTROPY FEATURE SELECTION

摘要：

Document classification is an important task in the field of document management.Bayesian model needs the feature independent assumption; Artificial Neural Network suffers from the overfitting problem; Support vector machine (SVM) does not do well in real-value feature.This paper proposes to combine entropy and machine learning techniques for document classification.Firstly, the cross entropy and average mutual information are presented to effectively extract the features.Secondly, the support vector machine and maximum entropy model is adopted respectively to perform the classification task in the feature space.Furthermore, an improved feature description instead the binary feature with the real-value is presented in this text, since the prior knowledge of each word is helpful to document classification.Finally, we compare our method with the traditional methods, and the experiment showed our method increased 2.78 ％ F-measures than basic ME model, and 0.95％ than naive bayes model which was smoothed by Good-Turing algorithm.

关键词： Support vector machine Entropy Feature extraction Maximum entropy model Document classification

作者: XIU-LI PANG YU-QIANG FENG WEI JIANG

作者单位: School of Management and Science, Harbin Institute of Technology School of Computer Science and Technology, Harbin Institute of Technology, 150001,Harbin, China

会议类型: 国际会议

会议名称: 2007 International Conference on Machine Learning and Cybernetics(IEEE第六届机器学习与控制论国际会议)

会议地点: 香港

会议语种:英文

页码: 3911-3915

在线出版日期: 2007-08-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

AN IMPROVED DOCUMENT CLASSIFICATION APPROACH WITH MAXIMUM ENTROPY AND ENTROPY FEATURE SELECTION