A Statistical Method for Uyghur Tokenization

摘要：

Tokenization is very important for Uyghur language processing. Tokenization of Uyghur, an agglutinative language, is quite different from other languages such as Chinese and English. In this paper we propose a two-steps statistical tokenization method for Uyghur. Two related factors, the feature template scheme and the manually tokenized corpora, are also discussed. The preliminary experiment results demonstrate that the proposed method is effective: the F-measure of tokenization reaches 88.9% in the open test.

关键词： Xinjiang Uyghur Morpheme Uyghur Suffiz Uyghur letter MEM

作者: Batuer Aisha Maosong Sun

作者单位: Department of Computer Sci. & Tech. State Key Lab on Intelligent Tech. & Sys. Tsinghua University, B State Key Lab on Intelligent Tech. & Sys .National Lab for Information Sci. & Tech. Tsinghua Univers

会议类型: 国际会议

会议名称: International Conference on Natural Language Processing and Knowledge Engineering(IEEE自然语言处理与知识工程国际会议 IEEE NLP-KE 2009)

会议地点: 大连

会议语种:英文

页码: 1-5

在线出版日期: 2009-09-24（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Statistical Method for Uyghur Tokenization