NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning based on Multiple Resources

摘要：

　　Many Chinese words similarity measure algorithms have been introduced since its a fundamental issue in various tasks of natural language pro-cessing.Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora.However,knowledge base and corpus have limitations for broad coverage and data update.Thus,ensemble learning is then used to improve performance by combing similarities.This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms.To be specific,knowledge-based methods are based on TYCCL and Hownet.Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora(news and microblog).All similarities are combined through support vector regression to get final sim-ilarity.Evaluation suggests that TYCCL-based method behaves best according to testing dataset.However,if tuning parameters appropriately,ensemble learning could outperform all the other algorithms.Besides,deep learning on news corpora is better than other corpus-based methods.

关键词： Chinese Word Similarity TYCCL Hownet Deep Learning Support Vector Regression

作者: Shutian Ma Xiaoyong Zhang Chengzhi Zhang

作者单位: Department of Information Management,Nanjing University of Science and Technology,Nanjing,China 2100 Department of Information Management,Nanjing University of Science and Technology,Nanjing,China 2100

会议类型: 国际会议

会议名称: 第五届自然语言处理与中文计算会议(NLPCC-ICCPOL2016)

会议地点: 昆明

会议语种:英文

页码: 1-8

在线出版日期: 2016-12-02（万方平台首次上网日期，不代表论文的发表时间）

会议专题

NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning based on Multiple Resources