会议专题

Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation

  This paper is concerned with exploring efficient domain adap tation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus.These sentences are selected by our proposed un supervised phrase-based data selection model.Compared with the tra ditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation.These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points.Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single in and general-domain baseline system, respectively.

domain adaptation phrase-based data selection pseudo in-domain subcorpora spoken language translation

Shixiang Lu Xingyuan Peng Zhenbiao Chen Bo Xu

Interactive Digital Media Technology Research Center (IDMTech), Institute of Automation, Chinese Academy of Sciences, Beijing, China

国际会议

Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)

重庆

英文

116-126

2013-11-15(万方平台首次上网日期,不代表论文的发表时间)