会议专题

Feature Selection on Chinese Text Classification Using Character N-Grams

In this paper,we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 clauses.We use different n-gram feature (1-,2-grams or 1-,2-,3-grams) to represent documents.Different feature weights (absolute text frequency,relative text frequency,absolute n-gram frequency and relative n-gram frequency) are compared.The sparseness of document by feature matrices is analyzed in various cases.We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task.We perform our experiments in the TANAGRA platform.We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.

Chinese text classification N-gram Feature selection

Zhihua Wei Duoqian Miao Jean-Hugues Chauchat Caiming Zhong

Tongji University,Key laboratory Embedded System and Service ComputingMinistry of Education,Shangh Tongji University,Key laboratory Embedded System and Service ComputingMinistry of Education,Shangh Université Lumière Lyon 2,Laboratoire ERIC,5 avenue Pierre Mendès-France,69676 Bron Cedex,France

国际会议

The Third International Conference on Rough Sets and Knowledge Tevhnology(RSKT 2008)(第三届粗糙集与知识技术国际会议)

成都

英文

500-507

2008-05-17(万方平台首次上网日期,不代表论文的发表时间)