Document Representation Combining Concepts and Words in Chinese Tezt Categorization
Word-based representation is widely used in text categorization. However, performance of this approach is affected by the problems derived from language variation. In this paper, we investigate a document representation combining words and concepts to integrate the advantages of two types of representations. The approach takes the part of speech as the concept for the word which is error-prone in word sense disambiguation to reduce the disambiguation mistakes. The approach employs three ways to measure the contributions of different representation forms to classification and selects the most productive one as the feature to drop the concepts not suitable for representation while not losing the lexical semantic information. We conduct experiments to compare the performance of different types of representations on Chinese text categorization corpus of Fudan University. And the results confirm the validity of our combination representation.
Tezt categorization combination representation concept-based representation
Chao CHE HongFei TENG
Dalian University of Technology Dalian, Liaoning, China
国际会议
大连
英文
1-5
2009-09-24(万方平台首次上网日期,不代表论文的发表时间)