An Efficient Framework to Extract Parallel Units from Comparable Data
Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora.However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments.Instead, we present an efficient framework to extract both sentential and sub-sentential units.At sententiai level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features.At sub-sentential level, we refer to the idea of phrase tables acquisition in SMT to extract parallel fragments.A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment.We integrate the two levels extraction task into a united framework.Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge.
statistical machine translation comparable corpora two-level parallel units extraction parallel sentences parallel sub-sentential fragments
Lu Xiang Yu Zhou Chengqing Zong
NLPR, Institute of Automation Chinese Academy of Sciences, Beijing, China
国际会议
Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)
重庆
英文
151-163
2013-11-15(万方平台首次上网日期,不代表论文的发表时间)