A Novel Approach to Improve the Mongolian Language Model using Intermediate Characters

摘要：

　　In Mongolian language,there is a phenomenon that many words have the same presentation form but represent different words with different codes.Since typists usually input the words according to their representation forms and cannot distinguish the codes sometimes,there are lots of coding errors occurred in Mongolian corpus.It results in statistic and retrieval very difficult on such a Mongolian corpus.To solve this problem,this paper proposed a method which merges the words with same presentation forms by Intermediate characters,then use the corpus in Intermediate characters form to build Mongolian language model.Experimental result shows that the proposed method can reduce the perplexity and the word error rate for the 3-gram language model by 41%and 30%respectively when comparing model trained on the corpus without processing.The proposed approach significantly improves the performance of Mongolian language model and greatly enhances the accuracy of Mongolian speech recognition.

关键词： Mongolian language Intermediate characters N-gram language model Speech recognition

作者: Xiaofei Yan Feilong Bao Hongxi Wei Xiangdong Su

作者单位: College of Computer Science,Inner Mongolia University,Hohhot 010021,China

会议类型: 国内会议

会议名称: 第十五届全国计算语言学学术会议(CCL2016)暨第四届基于自然标注大数据的自然语言处理国际学术研讨会(NLP-NABD-2016)

会议地点: 烟台

会议语种:英文

页码: 1-12

在线出版日期: 2016-10-14（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Novel Approach to Improve the Mongolian Language Model using Intermediate Characters