Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation

摘要：

　　This paper is concerned with exploring efficient domain adap tation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus.These sentences are selected by our proposed un supervised phrase-based data selection model.Compared with the tra ditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation.These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points.Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single in and general-domain baseline system, respectively.

关键词： domain adaptation phrase-based data selection pseudo in-domain subcorpora spoken language translation

作者: Shixiang Lu Xingyuan Peng Zhenbiao Chen Bo Xu

作者单位: Interactive Digital Media Technology Research Center (IDMTech), Institute of Automation, Chinese Academy of Sciences, Beijing, China

会议类型: 国际会议

会议名称: Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)

会议地点: 重庆

会议语种:英文

页码: 116-126

在线出版日期: 2013-11-15（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation