RESEARCH ON AUTOMATIC ACQUISITION OF DOMAIN TERMS
In order to solve the various issues in natural language processing more precisely, it is important to construct a system for automatic acquisition of domain terms. A method for automatic acquisition of domain terms from raw materials that are not segmented is presented in this paper. The raw domain corpus is pre-processed firstly. Then by using the method of Information Entropy and Log-likelihood ratio, we can extract candidate words automatically, after this we use the open-domain lexicon to preserve domain terms by removing general words. At last, confidence is used to remove the non-meaningful words to improve term acquisition accuracy from domain candidate term set, and the special domain lexicon is constructed finally. The experimental results show that this simple method is efficient in extracting most of the domain terms. The domain terms we extracted have been effectively applied in personalized Chinese word segmentation system.
Automatic Term Eztraction Domain Terms Information Entropy Log-Likelihood Ratio Natural Language Processing
JUAN LIU YUAN-CHAO LIU WEI JIANG XIAO-LONG WANG
Department of Computer Science and Technology, Harbin Institute of Technology, Haibin 150001, China
国际会议
2008 International Conference on Machine Learning and Cybernetics(2008机器学习与控制论国际会议)
昆明
英文
3026-3031
2008-07-12(万方平台首次上网日期,不代表论文的发表时间)