Chinese coding type identification based on sub-sentence length observation
This paper studied the identification algorithm of Chinese character coding type by analyzing the subsentence length. A sub-sentence definition is given in this paper and the pdf of sub-sentence length is analyzed based on the sentence samples from Lancaster corpus. We proposed a new algorithm to recognize the coding type of Chinese characters by splitting sentences into subsentences using Chinese punctuation characters and analyzing the probability of the observed sub-sentence length. In this algorithm we used both Bayesian rules and iterated sub-sentence length calculation for trust-region comparison. Because the size of Chinese punctuation characters set is very small, this algorithm has shown great advantages on the space complexity. Time complexity and identification performance are also studied in the end of the paper.
Chinese coding multi-octets coding GB BIG5 Unicode UTF-8 Chinese decoding typeidentification Bayesian rules
Gang HE Peidong PENG Xiaochun WU Luming CHEN
School of Information and Communication Engineering, Beijing University of Posts and Telecommunications No.10 Xi Tu Cheng Road, 100876, Beijing, China
国际会议
大连
英文
1-5
2009-09-24(万方平台首次上网日期,不代表论文的发表时间)