Collaborative Recognition and Recovery of the Chinese Intercept Abbreviation
One of the important works of Information Content Security is eval-uating the theme words of the text.Because of the variety of the Chinese ex-pression,especially of the abbreviation,the supervision of the theme words be-comes harder.The goal of this paper is to quickly and accurately discover the intercept abbreviations from the text crawled at the short time period.The paper firstly segments the target texts,and then utilizes the Supported Vector Machine(SVM)to recognize the abbreviations from the wrongly segmented texts as the candidates.Secondly,this paper presents the collaborative methods: Improve the Conditional Random Fields(CRF)to predict the corresponding word to each character of the abbreviation; To solve the problems of the 1:n relation-ship,collaboratively merge the ranking list from the predict steps with the matched results of the thesaurus of abbreviations.The experiments demonstrate that our method at the recognizing stage is 76.5%of the accuracy and 77.8%of the recall rate.At the recovery step,the accuracy is 62.1%,which is 20.8%higher than the method based on Hidden Markov Model(HMM).
Collaborative Recovery Improved CRF Chinese Abbreviation
Jinshuo LIU Yusen CHEN Juan DENG Donghong JI Jeff PAN
Computer School,Wuhan University,Wuhan 430072,China International School of Software,Wuhan University,Wuhan 430072,China University of Aberdeen,Aberdeen,AB24 3FX,UK
国内会议
第十六届全国计算语言学学术会议暨第五届基于自然标注大数据的自然语言处理国际学术研讨会
南京
英文
1-12
2017-10-13(万方平台首次上网日期,不代表论文的发表时间)