Word Segmentation on Micro-blog Texts with External Lexicon and Heterogeneous Data
This paper describes our system designed for the NLPCC 2016 shared task on word segmentation on micro-blog texts(i.e.,Weibo).We treat word segmentation as a character-wise sequence labeling prob-lem,and explore two directions to enhance our CRF-based baseline.First,we employ a large-scale external lexicon for constructing extra lexicon features in the model,which is proven to be extremely useful.Second,we exploit two heterogeneous datasets,i.e.,Penn Chinese Treebank 7(CTB7)and People Daily(PD)to help word segmentation on Weibo.We adopt two mainstream approaches,i.e.,the guide-feature based approach and the recently proposed coupled sequence labeling approach.We combine the above techniques in different ways and obtain four well-performing models.Finally,we merge the outputs of the four models and obtain the final results via Viterbi-based redecoding.On the test data of Weibo,our proposed approach outperforms the base-line by 95:63-94:24 = 1:39%in terms of F1 score.Our final system rank the first place among five participants in the open track in terms of F1 score,and is also the best among all 28 submissions.
Qingrong Xia Zhenghua Li Jiayuan Chao Min Zhang
Soochow University,Suzhou,China
国际会议
第五届自然语言处理与中文计算会议(NLPCC-ICCPOL2016)
昆明
英文
1-11
2016-12-02(万方平台首次上网日期,不代表论文的发表时间)