Chinese coding type identification based on sub-sentence length observation

摘要：

This paper studied the identification algorithm of Chinese character coding type by analyzing the subsentence length. A sub-sentence definition is given in this paper and the pdf of sub-sentence length is analyzed based on the sentence samples from Lancaster corpus. We proposed a new algorithm to recognize the coding type of Chinese characters by splitting sentences into subsentences using Chinese punctuation characters and analyzing the probability of the observed sub-sentence length. In this algorithm we used both Bayesian rules and iterated sub-sentence length calculation for trust-region comparison. Because the size of Chinese punctuation characters set is very small, this algorithm has shown great advantages on the space complexity. Time complexity and identification performance are also studied in the end of the paper.

关键词： Chinese coding multi-octets coding GB BIG5 Unicode UTF-8 Chinese decoding typeidentification Bayesian rules

作者: Gang HE Peidong PENG Xiaochun WU Luming CHEN

作者单位: School of Information and Communication Engineering, Beijing University of Posts and Telecommunications No.10 Xi Tu Cheng Road, 100876, Beijing, China

会议类型: 国际会议

会议名称: International Conference on Natural Language Processing and Knowledge Engineering(IEEE自然语言处理与知识工程国际会议 IEEE NLP-KE 2009)

会议地点: 大连

会议语种:英文

页码: 1-5

在线出版日期: 2009-09-24（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Chinese coding type identification based on sub-sentence length observation