会议专题

LOTUS-BN: A Thai Broadcast News Corpus and Its Research Applications

This paper describes the design and construction of the LOTUS-BN corpus, a Thai television broadcast news corpus. In addition to audio recordings and their transcription, this corpus also includes a detailed annotation of many interesting characteristics of broadcast news data such as acoustic condition, overlapping speech, news topic and named entity. The LOTUS-BN is still an ongoing project with the goal of collecting 100 hours of speech. We report initial statistics analyzed from 60 hours of speech which show that the LOTUS-BN corpus has a rich vocabulary of approximately 26,000 words with one third of them are named entities. Thus, this corpus is a good resource for developing an LVCSR system and investigating on named entity detection and recognition in addition to broadcast news related applications. Research applications on these topics are also discussed.

Ananlada Chotimongkol Kwanchiva Saykhum Patcharika Chootrakool Nattanun Thatphithakkul Chai Wutiwiwatchai

National Electronics and Computer Technology Center (NECTEC)112 Phahonyothin Rd., Klong Nueng, Klonng Luang, Pathumthani 12120, Thailand

国际会议

2009 Oriental COCOSDA International Conference on Speech Database and Assessments(2009 国际语音交互标准数据评估技术大会)

北京

英文

44-50

2009-08-10(万方平台首次上网日期,不代表论文的发表时间)