Feature Selection on Chinese Text Classification Using Character N-Grams

摘要：

In this paper,we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 clauses.We use different n-gram feature (1-,2-grams or 1-,2-,3-grams) to represent documents.Different feature weights (absolute text frequency,relative text frequency,absolute n-gram frequency and relative n-gram frequency) are compared.The sparseness of document by feature matrices is analyzed in various cases.We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task.We perform our experiments in the TANAGRA platform.We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.

关键词： Chinese text classification N-gram Feature selection

作者: Zhihua Wei Duoqian Miao Jean-Hugues Chauchat Caiming Zhong

作者单位: Tongji University,Key laboratory Embedded System and Service ComputingMinistry of Education,Shangh Tongji University,Key laboratory Embedded System and Service ComputingMinistry of Education,Shangh Université Lumière Lyon 2,Laboratoire ERIC,5 avenue Pierre Mendès-France,69676 Bron Cedex,France

会议类型: 国际会议

会议名称: The Third International Conference on Rough Sets and Knowledge Tevhnology(RSKT 2008)(第三届粗糙集与知识技术国际会议)

会议地点: 成都

会议语种:英文

页码: 500-507

在线出版日期: 2008-05-17（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Feature Selection on Chinese Text Classification Using Character N-Grams