会议专题

Document Representation Combining Concepts and Words in Chinese Tezt Categorization

Word-based representation is widely used in text categorization. However, performance of this approach is affected by the problems derived from language variation. In this paper, we investigate a document representation combining words and concepts to integrate the advantages of two types of representations. The approach takes the part of speech as the concept for the word which is error-prone in word sense disambiguation to reduce the disambiguation mistakes. The approach employs three ways to measure the contributions of different representation forms to classification and selects the most productive one as the feature to drop the concepts not suitable for representation while not losing the lexical semantic information. We conduct experiments to compare the performance of different types of representations on Chinese text categorization corpus of Fudan University. And the results confirm the validity of our combination representation.

Tezt categorization combination representation concept-based representation

Chao CHE HongFei TENG

Dalian University of Technology Dalian, Liaoning, China

国际会议

International Conference on Natural Language Processing and Knowledge Engineering(IEEE自然语言处理与知识工程国际会议 IEEE NLP-KE 2009)

大连

英文

1-5

2009-09-24(万方平台首次上网日期,不代表论文的发表时间)