An Efficient Framework to Extract Parallel Units from Comparable Data

摘要：

　　Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora.However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments.Instead, we present an efficient framework to extract both sentential and sub-sentential units.At sententiai level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features.At sub-sentential level, we refer to the idea of phrase tables acquisition in SMT to extract parallel fragments.A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment.We integrate the two levels extraction task into a united framework.Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge.

关键词： statistical machine translation comparable corpora two-level parallel units extraction parallel sentences parallel sub-sentential fragments

作者: Lu Xiang Yu Zhou Chengqing Zong

作者单位: NLPR, Institute of Automation Chinese Academy of Sciences, Beijing, China

会议类型: 国际会议

会议名称: Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)

会议地点: 重庆

会议语种:英文

页码: 151-163

在线出版日期: 2013-11-15（万方平台首次上网日期，不代表论文的发表时间）

会议专题

An Efficient Framework to Extract Parallel Units from Comparable Data