Tibetan Multi-word Expressions Identification Framework Based on News Corpora
This paper presents an identification framework for extracting Tibetan multi-word expressions.The framework includes two phases.In the first phase,sentences are segmented and high-frequency word-based ngrams are extracted using Nagaos N-gram statistical algorithm and Statistical Substring Reduction Algorithm.In the second phase,the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis.Context analysis,two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework.In experimental part,we evaluate the effec-tiveness of three strategies on small test data,and evaluate results of different granularity for Context analysis.On small test corpus,F-score above 75%have been achieved when words are segmented in preprocessing.On larger corpus,the P@N(N is 800)overcomes 85%.It indicates that the identification frame-work can work well on larger corpus.The experimental result reaches accepta-ble performance for Tibetan MWEs.
Tibetan Multi-word expression Two-word Coupling Degree inside word probability
Minghua Nuo Congjun Lun Huidan Liu
College of Computer Science-College of Software Engineering,Inner Mongolia University Institute of Ethnology and Anthropology,Chinese Academy of Social Sciences;Institute of Software,Chi Institute of Software,Chinese Academy of Sciences
国际会议
第五届自然语言处理与中文计算会议(NLPCC-ICCPOL2016)
昆明
英文
1-12
2016-12-02(万方平台首次上网日期,不代表论文的发表时间)