会议专题

Chinese coding type identification based on sub-sentence length observation

This paper studied the identification algorithm of Chinese character coding type by analyzing the subsentence length. A sub-sentence definition is given in this paper and the pdf of sub-sentence length is analyzed based on the sentence samples from Lancaster corpus. We proposed a new algorithm to recognize the coding type of Chinese characters by splitting sentences into subsentences using Chinese punctuation characters and analyzing the probability of the observed sub-sentence length. In this algorithm we used both Bayesian rules and iterated sub-sentence length calculation for trust-region comparison. Because the size of Chinese punctuation characters set is very small, this algorithm has shown great advantages on the space complexity. Time complexity and identification performance are also studied in the end of the paper.

Chinese coding multi-octets coding GB BIG5 Unicode UTF-8 Chinese decoding typeidentification Bayesian rules

Gang HE Peidong PENG Xiaochun WU Luming CHEN

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications No.10 Xi Tu Cheng Road, 100876, Beijing, China

国际会议

International Conference on Natural Language Processing and Knowledge Engineering(IEEE自然语言处理与知识工程国际会议 IEEE NLP-KE 2009)

大连

英文

1-5

2009-09-24(万方平台首次上网日期,不代表论文的发表时间)