How Unsupervised Learning Affects Character Tagging based Chinese Word Segmentation: A Quantitative Investigation
Integrating global information of unsupervised segmentation into Conditional Random Fields (CRF) learning has been proved effective to enhance the performance of the character tagging based Chinese Word Segmentation. By comparing CRF models with and without unsupervised learning enhancement, we investigate how unsupervised learning affects the performance. Especially, two kinds of segmented words, in-vocabulary and out-of-vocabulary words, are separately analyzed case by case to see what part of those words are affected by unsupervised learning. In addition, the cost of the additional features derived from unsupervised segmentation are also taken into account and evaluated.
Unsupervised learning Chinese word segmentation in-vocahulary words out-of-vocabulary words frequent substring eztraction
YAN SONG CHUNYU KIT RUIFENG XU HAI ZHAO
Department of Chinese, Translation and Linguistics City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong
国际会议
2009 International Conference on Machine Learning and Cybernetics(2009机器学习与控制论国际会议)
保定
英文
3481-3486
2009-07-12(万方平台首次上网日期,不代表论文的发表时间)