会议专题

Unsupervised Approach of Data Selection for Language Model Adaptation using Generalized Word Posterior Probability

This paper reports an unsupervised approach toward data selection for language model adaptation that is used for improving spontaneous speech recognition in a speech-tospeech translation (S2ST) system. The approach is characterized by the following: 1) it obtains speech data from a real environment (sightseeing sites), in the travel domain, (2) it utilizes the recognition results of the above collected speech for the language model adaptation, (3) it applies generalized word posterior probability (GWPP) among the N-best recognition hypotheses for the base of an utterance confidence measure to select adaptation utterances, (4) it utilizes a collected proper noun lexicon to the baseline language model in the form of zeroton event, so that it has ability to recognize new proper noun words that are previously not contained in the recognition lexicon. By experiments on a Chinese speech test collected from a set of field experiments at five sightseeing areas in Japan, using the above adapted language model, average absolute reductions of 7.6% of the character error rate (CER) were obtained, which is more than the baseline language model. This reduction is over 77% of the 9.8% reduction obtained by the supervised adaptation. By manually correcting a small amount of utterances that were not selected due to their low confidences, and adding them to the above adaptation data, nearly 83% of the reduction by the supervised method can be achieved. The proposed approach effectively improves utterance selection, especially for those containing proper nouns, and is expected to reduce the cost of manual transcription.

Xinhui Hu Shigeki Matsuda Hideki Kashioka

National Institute of Information and Communications Technology, Hikaridai 3-5, Seikacho, Sorakugun, National Institute of Information and Communications Technology,Hikaridai 3-5, Seikacho, Sorakugun,

国际会议

2011亚太信号与信息处理协会年度峰会(APSIPAASC 2011)

西安

英文

1-5

2011-10-18(万方平台首次上网日期,不代表论文的发表时间)