会议专题

RESEARCH ON AUTOMATIC ACQUISITION OF DOMAIN TERMS

In order to solve the various issues in natural language processing more precisely, it is important to construct a system for automatic acquisition of domain terms. A method for automatic acquisition of domain terms from raw materials that are not segmented is presented in this paper. The raw domain corpus is pre-processed firstly. Then by using the method of Information Entropy and Log-likelihood ratio, we can extract candidate words automatically, after this we use the open-domain lexicon to preserve domain terms by removing general words. At last, confidence is used to remove the non-meaningful words to improve term acquisition accuracy from domain candidate term set, and the special domain lexicon is constructed finally. The experimental results show that this simple method is efficient in extracting most of the domain terms. The domain terms we extracted have been effectively applied in personalized Chinese word segmentation system.

Automatic Term Eztraction Domain Terms Information Entropy Log-Likelihood Ratio Natural Language Processing

JUAN LIU YUAN-CHAO LIU WEI JIANG XIAO-LONG WANG

Department of Computer Science and Technology, Harbin Institute of Technology, Haibin 150001, China

国际会议

2008 International Conference on Machine Learning and Cybernetics(2008机器学习与控制论国际会议)

昆明

英文

3026-3031

2008-07-12(万方平台首次上网日期,不代表论文的发表时间)