RESEARCH ON AUTOMATIC ACQUISITION OF DOMAIN TERMS

摘要：

In order to solve the various issues in natural language processing more precisely, it is important to construct a system for automatic acquisition of domain terms. A method for automatic acquisition of domain terms from raw materials that are not segmented is presented in this paper. The raw domain corpus is pre-processed firstly. Then by using the method of Information Entropy and Log-likelihood ratio, we can extract candidate words automatically, after this we use the open-domain lexicon to preserve domain terms by removing general words. At last, confidence is used to remove the non-meaningful words to improve term acquisition accuracy from domain candidate term set, and the special domain lexicon is constructed finally. The experimental results show that this simple method is efficient in extracting most of the domain terms. The domain terms we extracted have been effectively applied in personalized Chinese word segmentation system.

关键词： Automatic Term Eztraction Domain Terms Information Entropy Log-Likelihood Ratio Natural Language Processing

作者: JUAN LIU YUAN-CHAO LIU WEI JIANG XIAO-LONG WANG

作者单位: Department of Computer Science and Technology, Harbin Institute of Technology, Haibin 150001, China

会议类型: 国际会议

会议名称: 2008 International Conference on Machine Learning and Cybernetics(2008机器学习与控制论国际会议)

会议地点: 昆明

会议语种:英文

页码: 3026-3031

在线出版日期: 2008-07-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

RESEARCH ON AUTOMATIC ACQUISITION OF DOMAIN TERMS