Exploring Multiple Chinese Word Segmentation Results Based on Linear Model

摘要：

　　In the process of developing a domain-specific Chinese-English machine translation system, the accuracy of Chinese word segmentation on large amounts of training text often decreases because of unknown words.The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt to a target domain.This problem results in many errors in trans lation knowledge extraction and therefore seriously lowers translation quality.To solve the domain adaptation problem, we implement Chinese word segmen tation by exploring n-gram statistical features in large Chinese raw corpus and bilingually motivated Chinese word segmentation, respectively.Moreover, we propose a method of combining multiple Chinese word segmentation results based on linear model to augment domain adaptation.For evaluation, we con duct experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task.The expe rimental results showed that the proposed method achieves improvements in both F-measure of the Chinese word segmentation and BLEU score of the Chinese-English statistical machine translation system.

关键词： Chinese Word Segmentation Domain Adaptation Bilingual Motivation Linear Model Machine Translation

作者: Chen Su Yujie Zhang Zhen Guo Jinan Xu

作者单位: School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

会议类型: 国际会议

会议名称: Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)

会议地点: 重庆

会议语种:英文

页码: 50-59

在线出版日期: 2013-11-15（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Exploring Multiple Chinese Word Segmentation Results Based on Linear Model