Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding

摘要：

　　Obtaining bilingual parallel data from the multilingual websites is along-standing research problem,which is very benefit for resource-scarce lan-guages.In this paper,we present an approach for obtaining parallel data based on word embedding,and our model only rely on a small scale of bilingual lexi-con.Our approach benefit from the recent advances of continuous word repre-sentations,which can reveal more context information compared with tradition-al methods.Our experiments show that high-precision and sizable parallel Uy-ghur-Chinese data can be obtained for lacking bilingual lexicon.

关键词： bilingual parallel data word embedding resource-scarce languages

作者: ShaoLin Zhu Xiao Li YaTing Yang Lei Wang ChengGang Mi

作者单位: University of Chinese Academy of Sciences,Beijing,China The Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Sciences,Urumqi,China

会议类型: 国内会议

会议名称: 第十六届全国计算语言学学术会议暨第五届基于自然标注大数据的自然语言处理国际学术研讨会

会议地点: 南京

会议语种:英文

页码: 1-12

在线出版日期: 2017-10-13（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding