Word Segmentation on Micro-blog Texts with External Lexicon and Heterogeneous Data

摘要：

　　This paper describes our system designed for the NLPCC 2016 shared task on word segmentation on micro-blog texts(i.e.,Weibo).We treat word segmentation as a character-wise sequence labeling prob-lem,and explore two directions to enhance our CRF-based baseline.First,we employ a large-scale external lexicon for constructing extra lexicon features in the model,which is proven to be extremely useful.Second,we exploit two heterogeneous datasets,i.e.,Penn Chinese Treebank 7(CTB7)and People Daily(PD)to help word segmentation on Weibo.We adopt two mainstream approaches,i.e.,the guide-feature based approach and the recently proposed coupled sequence labeling approach.We combine the above techniques in different ways and obtain four well-performing models.Finally,we merge the outputs of the four models and obtain the final results via Viterbi-based redecoding.On the test data of Weibo,our proposed approach outperforms the base-line by 95:63-94:24 = 1:39%in terms of F1 score.Our final system rank the first place among five participants in the open track in terms of F1 score,and is also the best among all 28 submissions.

作者: Qingrong Xia Zhenghua Li Jiayuan Chao Min Zhang

作者单位: Soochow University,Suzhou,China

会议类型: 国际会议

会议名称: 第五届自然语言处理与中文计算会议(NLPCC-ICCPOL2016)

会议地点: 昆明

会议语种:英文

页码: 1-11

在线出版日期: 2016-12-02（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Word Segmentation on Micro-blog Texts with External Lexicon and Heterogeneous Data