Construction of Chinese Conversational Corpora for Spontaneous Speech Recognition and Comparative Study on the Trilingual Parallel Corpora

摘要：

In this paper, we describe the development of Chinese conversational segmented and POS-tagged corpora currently used in the NICT/ATR speech-to-speech translation system. Over 500K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, they are the largest conversational textual corpora; in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Based on these parallel corpora, we make an investigation on the statistics of each language, performances of language model and speech recognition, and find the differences among these languages. The problems and their solutions to the present Chinese corpora are also analyzed and discussed.

作者: Xinhui Hu Ryosuke Isotani Satoshi Nakamura

作者单位: National Institute of Information and Communications Technology, Japan

会议类型: 国际会议

会议名称: 2009 Oriental COCOSDA International Conference on Speech Database and Assessments(2009 国际语音交互标准数据评估技术大会)

会议地点: 北京

会议语种:英文

页码: 56-59

在线出版日期: 2009-08-10（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Construction of Chinese Conversational Corpora for Spontaneous Speech Recognition and Comparative Study on the Trilingual Parallel Corpora