Cluster Based Symbolic Representation and Feature Selection for Text Classification

摘要：

In this paper, we propose a new method of representing documents based on clustering of term frequency vectors. For each class of documents we propose to create multiple clusters to preserve the intraclass variations. Term frequency vectors of each cluster are used to form a symbolic representation by the use of interval valued features. Subsequently we propose a novel symbolic method for feature selection. The corresponding symbolic text classification is also presented. To corroborate the efficacy of the proposed model we conducted an experimentation on various datasets. Experimental results reveal that the proposed method gives better results when compared to the state of the art techniques. In addition, as the method is based on a simple matching scheme, it requires a negligible time.

关键词： Text Document Term Frequency Vector Fuzzy C Means Symbolic Representation Interval Valued Features Symbolic Feature Selection Text Classification

作者: B.S. Harish D.S. Guru S. Manjunath R. Dinesh

作者单位: Department of Studies in Computer Science,University of Mysore, Mysore 570 006, India Honeywell Technologies Ltd Bangalore, India

会议类型: 国际会议

会议名称: 6th International Conference on Advanced Data Mining and Applications(第六届先进数据挖掘及应用国际会议 ADMA 2010)

会议地点: 重庆

会议语种:英文

页码: 158-166

在线出版日期: 2010-11-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Cluster Based Symbolic Representation and Feature Selection for Text Classification