A Weibo-Oriented Method for Unknown Word Extraction
Unknown word recognition is one of the most prominent and challenging problems in the Chinese language processing.Some effective approaches have been proposed,however,they do not work well on Chinese twitter (i.e.weibo) messages.In this paper,a method is presented to recognize unknown words from weibo.Due to the great flexibility in wording and highly correlation between unknown words and unpredictable topics,which are exhibited in weibo messages,the proposed method firstly groups the corpus into multiple categories by using K-means;then,from each of the categories,a morpheme set is derived based on local terms frequencies.Secondly,as for each potential unknown word in every morpheme set,a newly introduced measure (named adjacency degree) is calculated to see if a correct unknown word is found.It could be shown by the experiments that the proposed method is efficient,precise,and insensitive to the size of the weibo corpus.
Unknown Word Extraction Local Threshold Adjacency Degree Improved K-means
Shuai Zhang Qianren Liu Lei Wang
School of Information and Communication Engineering Beijing University of Posts and Telecommunications Beijing, China
国际会议
第8届语义知识与网络国际会议(2012 Eighth International Conference on Semanties,Knowledge and Grids )(SKG2012)
北京
英文
209-212
2012-10-22(万方平台首次上网日期,不代表论文的发表时间)