Enhancing LSTM-based Word Segmentation Using Unlabeled Data

摘要：

　　Word segmentation problem is widely solved as the sequence labeling problem.The traditional way to this kind of problem is ma-chine learning method like conditional random field with hand-crafted features.Recently,deep learning approaches have achieved state-of-the-art performance on word segmentation task and a popular method of them is LSTM networks.This paper gives a method to introduce numer-ical statistics-based features counted on unlabeled data into LSTM net-works and analyzes how it enhances the performance of word segmenta-tion model.We add pre-trained character-bigram embedding,pointwise mutual information,accessor variety and punctuation variety into our model and compare their performances on different datasets including three datasets from CoNLL-2017 shared task and three datasets of sim-plified Chinese.We achieve the state-of-the-art performance on two of them and get comparable results on the rest.

关键词： word segmentation statistics-based features neural net-work unlabeled data

作者: Bo Zheng Wanxiang Che Jiang Guo Ting Liu

作者单位: Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,China

会议类型: 国内会议

会议名称: 第十六届全国计算语言学学术会议暨第五届基于自然标注大数据的自然语言处理国际学术研讨会

会议地点: 南京

会议语种:英文

页码: 1-11

在线出版日期: 2017-10-13（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Enhancing LSTM-based Word Segmentation Using Unlabeled Data