会议专题

Tibetan Multi-word Expressions Identification Framework Based on News Corpora

  This paper presents an identification framework for extracting Tibetan multi-word expressions.The framework includes two phases.In the first phase,sentences are segmented and high-frequency word-based ngrams are extracted using Nagaos N-gram statistical algorithm and Statistical Substring Reduction Algorithm.In the second phase,the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis.Context analysis,two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework.In experimental part,we evaluate the effec-tiveness of three strategies on small test data,and evaluate results of different granularity for Context analysis.On small test corpus,F-score above 75%have been achieved when words are segmented in preprocessing.On larger corpus,the P@N(N is 800)overcomes 85%.It indicates that the identification frame-work can work well on larger corpus.The experimental result reaches accepta-ble performance for Tibetan MWEs.

Tibetan Multi-word expression Two-word Coupling Degree inside word probability

Minghua Nuo Congjun Lun Huidan Liu

College of Computer Science-College of Software Engineering,Inner Mongolia University Institute of Ethnology and Anthropology,Chinese Academy of Social Sciences;Institute of Software,Chi Institute of Software,Chinese Academy of Sciences

国际会议

第五届自然语言处理与中文计算会议(NLPCC-ICCPOL2016)

昆明

英文

1-12

2016-12-02(万方平台首次上网日期,不代表论文的发表时间)