Tibetan Multi-word Expressions Identification Framework Based on News Corpora

摘要：

　　This paper presents an identification framework for extracting Tibetan multi-word expressions.The framework includes two phases.In the first phase,sentences are segmented and high-frequency word-based ngrams are extracted using Nagaos N-gram statistical algorithm and Statistical Substring Reduction Algorithm.In the second phase,the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis.Context analysis,two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework.In experimental part,we evaluate the effec-tiveness of three strategies on small test data,and evaluate results of different granularity for Context analysis.On small test corpus,F-score above 75%have been achieved when words are segmented in preprocessing.On larger corpus,the P@N(N is 800)overcomes 85%.It indicates that the identification frame-work can work well on larger corpus.The experimental result reaches accepta-ble performance for Tibetan MWEs.

关键词： Tibetan Multi-word expression Two-word Coupling Degree inside word probability

作者: Minghua Nuo Congjun Lun Huidan Liu

作者单位: College of Computer Science-College of Software Engineering,Inner Mongolia University Institute of Ethnology and Anthropology,Chinese Academy of Social Sciences;Institute of Software,Chi Institute of Software,Chinese Academy of Sciences

会议类型: 国际会议

会议名称: 第五届自然语言处理与中文计算会议(NLPCC-ICCPOL2016)

会议地点: 昆明

会议语种:英文

页码: 1-12

在线出版日期: 2016-12-02（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Tibetan Multi-word Expressions Identification Framework Based on News Corpora