Semi-supervised Learning for Mongolian Morphological Segmentation
Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists,we explore a novel semi-supervised method for a practical application,i.e.,statistical machine translation(SMT),based on a low-resource learning setting,in which a small amount of labeled data and large amount of unlabeled data are available.First,a CRF-based supervised learning is exploited to predict morpheme boundaries by using small labeled data.Then,a lexicon-based segmentation model with small labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data.Finally,we present some error correction models to revise segmentation results.Experimental results show that our method can improve the segmentation results compared with the pure supervised learning.Besides,we integrate the morphological segmentation result into Chinese-Mongolian SMT and achieve the satisfactory performance compared with the baseline.
Semi-supervised learning Morphological segmentation Statistical machine translation Low-resource language
Zhenxin Yang Miao Li Lei Chen Weihui Zeng Yi Gao Sha Fu
Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031,China;University of Scien Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031,China Yunnan Agricultural Expert System Leading Group Office,Kunming 650000,China
国内会议
第十五届全国计算语言学学术会议(CCL2016)暨第四届基于自然标注大数据的自然语言处理国际学术研讨会(NLP-NABD-2016)
烟台
英文
1-10
2016-10-14(万方平台首次上网日期,不代表论文的发表时间)