会议专题

Collective Corpus Weighting and Phrase Scoring for SMT Using Graph-Based Random Walk

  Data quality is one of the key factors in Statistical Machine Translation (SMT).Previous research addressed the data quality prob lem in SMT by corpus weighting or phrase scoring, but these two types of methods were often investigated independently.To leverage the de pendencies between them, we propose an intuitive approach to improve translation modeling by collective corpus weighting and phrase scoring.The method uses the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa.An effective graph-based random walk is designed to estimate the quality of sentence pairs and phrase pairs simultaneously.Extensive experimental results show that our method improves performance significantly and consistently in several Chinese-to-English translation tasks.

data quality corpus weighting phrase scoring graph-based random walk

Lei Cui Dongdong Zhang Shujie Liu Mu Li Ming Zhou

School of Computer Science and Technology Harbin Institute of Technology, Harbin, China Microsoft Research Asia, Beijing, China

国际会议

Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)

重庆

英文

176-187

2013-11-15(万方平台首次上网日期,不代表论文的发表时间)