Exploring Multiple Chinese Word Segmentation Results Based on Linear Model
In the process of developing a domain-specific Chinese-English machine translation system, the accuracy of Chinese word segmentation on large amounts of training text often decreases because of unknown words.The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt to a target domain.This problem results in many errors in trans lation knowledge extraction and therefore seriously lowers translation quality.To solve the domain adaptation problem, we implement Chinese word segmen tation by exploring n-gram statistical features in large Chinese raw corpus and bilingually motivated Chinese word segmentation, respectively.Moreover, we propose a method of combining multiple Chinese word segmentation results based on linear model to augment domain adaptation.For evaluation, we con duct experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task.The expe rimental results showed that the proposed method achieves improvements in both F-measure of the Chinese word segmentation and BLEU score of the Chinese-English statistical machine translation system.
Chinese Word Segmentation Domain Adaptation Bilingual Motivation Linear Model Machine Translation
Chen Su Yujie Zhang Zhen Guo Jinan Xu
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
国际会议
Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)
重庆
英文
50-59
2013-11-15(万方平台首次上网日期,不代表论文的发表时间)