Chinese Text Classification using Key Characters String Kernel

摘要：

Most Chinese text classification methods are based on Chinese word segmentation and Bag of Words (BOW). The classification performance largely relies on the accuracy of segmentation. Unfortunately, perfect precision and disambiguation of segmentation cannot be reached. In order to solve this problem, a novel Chinese text classification method using string kernel is presented. String kernel computes the similarity of a pair of documents by comparing common substrings they have. Experimental results show that our method greatly enhances the classification on small training data sets. Although the performance of traditional string kernel is comparable to that of BOW methods on larger data set, the dimension of feature space is so high that the calculation process is very time-consuming. Our proposed key characters string kernel technique solves the efficiency and effectiveness problems. Experiments on larger data set show that SVM with Key Characters String Kernel can achieve superior performance.

作者: Shiqiang Zheng Yujiu Yang Haiping Wu Wenhuang Liu

作者单位: Graduate School at Shenzhen, Tsinghua Universtity Shenzhen 518055, P.R.China

会议类型: 国际会议

会议名称: Fifth International Conference on Semantics,Knowledge and Grid(第五届语义、知识与网格国际会议 SKG 2009)

会议地点: 珠海

会议语种:英文

页码: 113-119

在线出版日期: 2009-10-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Chinese Text Classification using Key Characters String Kernel