Collective Corpus Weighting and Phrase Scoring for SMT Using Graph-Based Random Walk

摘要：

　　Data quality is one of the key factors in Statistical Machine Translation (SMT).Previous research addressed the data quality prob lem in SMT by corpus weighting or phrase scoring, but these two types of methods were often investigated independently.To leverage the de pendencies between them, we propose an intuitive approach to improve translation modeling by collective corpus weighting and phrase scoring.The method uses the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa.An effective graph-based random walk is designed to estimate the quality of sentence pairs and phrase pairs simultaneously.Extensive experimental results show that our method improves performance significantly and consistently in several Chinese-to-English translation tasks.

关键词： data quality corpus weighting phrase scoring graph-based random walk

作者: Lei Cui Dongdong Zhang Shujie Liu Mu Li Ming Zhou

作者单位: School of Computer Science and Technology Harbin Institute of Technology, Harbin, China Microsoft Research Asia, Beijing, China

会议类型: 国际会议

会议名称: Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)

会议地点: 重庆

会议语种:英文

页码: 176-187

在线出版日期: 2013-11-15（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Collective Corpus Weighting and Phrase Scoring for SMT Using Graph-Based Random Walk