Chinese Unknown Words Extraction Based on Word-Level Characteristics

摘要：

The automatic recognition of unknown words is an important problem in Chinese information processing. Based on the characteristics of words, this paper proposes a method to recognize new words using high frequent strings. Firstly, the high frequent strings from each single document are extracted as candidate strings. Then the strings that cannot satisfy the characteristics of word’s distribution and word’s independently usage are removed. Finally, segment the entire corpus with these candidate strings, and count the word-frequency for further filtering. Experimental results show that, on the documents about basketball downloaded from Zaobao Newspaper, this method achieves an F-score of 79.39％.

关键词： Chinese unknown word word distribution independent usage

作者: Wenbo Pang Xiaozhong Fan Yijun Gu Jiangde Yu

作者单位: School of Computer and Technology Beijing Institute of Technology Beijing 100081, China School of Computer and Technology Beijing Institute of Technology. Beijing 100081, China College of Information Security & Engineering Chinese People’s Public Security University Beijing 10 School of Computer and Information Engineering Anyang Normal University Anyang 455000, China

会议类型: 国际会议

会议名称: 2009 Ninth International Conference on Hybrid Intelligent Systems(第九届混合智能系统国际会议 HIS 2009)

会议地点: 沈阳

会议语种:英文

页码: 1-6

在线出版日期: 2009-08-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Chinese Unknown Words Extraction Based on Word-Level Characteristics