会议专题

A Hybrid Method to Segment Words

Word segmentation is the foundations of machine translation, text classification and information searching. A method is proposed which combines word segmentation based on dictionary with reverse maximum matching and word segmentation based on statistic with suffix array. The input texts are segmented using the reserve maximum matching method based on dictionary, and a two-way suffix arrays are constructed, longest common prefix are computed, candidate words are filtered out by setting the threshold, the candidate words are filtered using mutual information in order to the true words. The texts that are ambiguity are filtered using information entropy. It is showed that the accuracy of word segmentation may achieve above 97% in the experiment.

Reserve maximum matching Word segmentation based on dictionary Suffix array LCP

Yubiao Dai Xueli Ren

Department of Computer Science and Engineering QuJing Normal University Qujing, China

国际会议

2011 3rd International Conference on Computer and Network Technology(ICCNT 2011)(2011第三届IEEE计算机与网络技术国际会议)

太原

英文

12-15

2011-02-26(万方平台首次上网日期,不代表论文的发表时间)