Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity

摘要：

To overcome the limitations of traditional text classification approaches based on bag-of-words representation and to effectively incorporate linguistic knowledge and conceptual index into text vector space model, based on two thesaurus HowNet and Tongyici Cilin(hereinafter referred to Cilin), we use semantic vector to describe a document instead of traditional keywords vector, which is based on merging words with high similarity and using a concept to describe the semantic feature rather than a series of words. It not only reduces feature dimension but also adds semantic information to the vector. We also use sentence (document) similarity based on simple vector distance to classify the text and three groups of experiments are made respectively. The results show that the accuracy rates are generally improved along with semantic treatment, which indicates that semantic mining is very important and necessary to text classification.

关键词： Text Classification Semantic Vector Semantic Similarity HowNet Tongyici Cilin

作者: Xin Song Jia Huang Jing-min Zhou Xi Chen

作者单位: State Key Laboratory of Software Development Environment, Beihang University 100191, Beijing, China Institute of Software Chinese Academy of Sciences 100190, Beijing, China

会议类型: 国际会议

会议名称: 2009 International Forum on Computer Science-Technology and Applications(2009年国际计算机科学技术与应用论坛 IFCSTA 2009)

会议地点: 重庆

会议语种:英文

页码: 669-672

在线出版日期: 2009-12-25（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity