NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning based on Multiple Resources
Many Chinese words similarity measure algorithms have been introduced since its a fundamental issue in various tasks of natural language pro-cessing.Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora.However,knowledge base and corpus have limitations for broad coverage and data update.Thus,ensemble learning is then used to improve performance by combing similarities.This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms.To be specific,knowledge-based methods are based on TYCCL and Hownet.Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora(news and microblog).All similarities are combined through support vector regression to get final sim-ilarity.Evaluation suggests that TYCCL-based method behaves best according to testing dataset.However,if tuning parameters appropriately,ensemble learning could outperform all the other algorithms.Besides,deep learning on news corpora is better than other corpus-based methods.
Chinese Word Similarity TYCCL Hownet Deep Learning Support Vector Regression
Shutian Ma Xiaoyong Zhang Chengzhi Zhang
Department of Information Management,Nanjing University of Science and Technology,Nanjing,China 2100 Department of Information Management,Nanjing University of Science and Technology,Nanjing,China 2100
国际会议
第五届自然语言处理与中文计算会议(NLPCC-ICCPOL2016)
昆明
英文
1-8
2016-12-02(万方平台首次上网日期,不代表论文的发表时间)