A Weibo-Oriented Method for Unknown Word Extraction

摘要：

　　Unknown word recognition is one of the most prominent and challenging problems in the Chinese language processing.Some effective approaches have been proposed,however,they do not work well on Chinese twitter (i.e.weibo) messages.In this paper,a method is presented to recognize unknown words from weibo.Due to the great flexibility in wording and highly correlation between unknown words and unpredictable topics,which are exhibited in weibo messages,the proposed method firstly groups the corpus into multiple categories by using K-means;then,from each of the categories,a morpheme set is derived based on local terms frequencies.Secondly,as for each potential unknown word in every morpheme set,a newly introduced measure (named adjacency degree) is calculated to see if a correct unknown word is found.It could be shown by the experiments that the proposed method is efficient,precise,and insensitive to the size of the weibo corpus.

关键词： Unknown Word Extraction Local Threshold Adjacency Degree Improved K-means

作者: Shuai Zhang Qianren Liu Lei Wang

作者单位: School of Information and Communication Engineering Beijing University of Posts and Telecommunications Beijing, China

会议类型: 国际会议

会议名称: 第8届语义知识与网络国际会议(2012 Eighth International Conference on Semanties,Knowledge and Grids )(SKG2012)

会议地点: 北京

会议语种:英文

页码: 209-212

在线出版日期: 2012-10-22（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Weibo-Oriented Method for Unknown Word Extraction