Semi-supervised Learning for Mongolian Morphological Segmentation

摘要：

　　Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists,we explore a novel semi-supervised method for a practical application,i.e.,statistical machine translation(SMT),based on a low-resource learning setting,in which a small amount of labeled data and large amount of unlabeled data are available.First,a CRF-based supervised learning is exploited to predict morpheme boundaries by using small labeled data.Then,a lexicon-based segmentation model with small labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data.Finally,we present some error correction models to revise segmentation results.Experimental results show that our method can improve the segmentation results compared with the pure supervised learning.Besides,we integrate the morphological segmentation result into Chinese-Mongolian SMT and achieve the satisfactory performance compared with the baseline.

关键词： Semi-supervised learning Morphological segmentation Statistical machine translation Low-resource language

作者: Zhenxin Yang Miao Li Lei Chen Weihui Zeng Yi Gao Sha Fu

作者单位: Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031,China;University of Scien Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031,China Yunnan Agricultural Expert System Leading Group Office,Kunming 650000,China

会议类型: 国内会议

会议名称: 第十五届全国计算语言学学术会议(CCL2016)暨第四届基于自然标注大数据的自然语言处理国际学术研讨会(NLP-NABD-2016)

会议地点: 烟台

会议语种:英文

页码: 1-10

在线出版日期: 2016-10-14（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Semi-supervised Learning for Mongolian Morphological Segmentation