New Feature Selection Methods Based on Context Similarity for Text Categorization

摘要：

　　High dimensionality of the feature space is one of the most important concerns in text categorization problems,and feature selection is widely used for reducing the dimensionality of features to speed up the computation without damaging the performance.However,a lot traditional feature selection methods treat each feature separately,and they are context independent.In order to address the problem,this paper first presents the study of four well known frequency based feature selection methods,including Gini Index(GI),Document Frequency(DF),Class Discriminating Measure(CDM)and Accuracy Balanced(Acc2).Then we focus on calculating the importance of features through measuring the similarity of their contexts among the documents but the document frequency containing these features to incorporate context information.Hence we propose four new context similarity based feature selection methods,GIcs,DFcs,CDMcs and Acc2cs.They are evaluated on different data sets and compared against the four corresponding frequency based methods.Through experimental analysis,the results reveal that the context similarity based methods outperform the corresponding frequency based methods in terms of the micro and macro F1 measures both on binary and multi-classification problems.Benefit from the multi-words information surrounding features,the context similarity based feature selection methods are effective for article categorization.

作者: Yifei Chen Bingqing Han Ping Hou

作者单位: School of Information Science Nanjing Audit University 86 Yushan Rd(W),Nanjing,P.R.China Fondazione Bruno Kessler(FBK-irst)Trento,Italy

会议类型: 国际会议

会议名称: The 2014 10th International Conference on Natural Computation (ICNC 2014) and the 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2014)(第十届自然计算和第十一届模糊系统与知识发现国际会议)

会议地点: 厦门

会议语种:英文

页码: 607-613

在线出版日期: 2014-08-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

New Feature Selection Methods Based on Context Similarity for Text Categorization