A Fast Matching Method Based on Semantic Similarity for Short Texts

摘要：

　　As the emergence of various social media, short texts, such as weibos and instant messages, are very prevalent on todays websites.In order to mine semantically similar information from massive data, a fast and efficient matching method for short texts has become an urgent task.However, the conventional matching methods suffer from the data sparsity in short documents.In this paper, we propose a novel matching method, referred as semantically similar hashing (SSHash).The basic idea of SSHash is to directly train a topic model from corpus rather than documents, then project texts into hash codes by using latent features.The major advantages of SSHash are that 1) SSHash alleviates the sparse problem in short texts, because we obtain the latent features from whole corpus regardless of document level; and 2) SSHash can accomplish similar matching in an interactive real time by introducing hash method.We carry out extensive experiments on real-world short texts.The results demonstrate that our method significantly outperforms baseline methods on several evaluation metrics.

关键词： Short Text Semantically Similar Matching Topic Model Hash

作者: Jiaming Xu Pengcheng Liu Gaowei Wu Zhengya Sun Bo Xu Hongwei Hao

作者单位: Institute of Automation, Chinese Academy of Sciences, 100190, Beijing, P.R.China

会议类型: 国际会议

会议名称: Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)

会议地点: 重庆

会议语种:英文

页码: 299-309

在线出版日期: 2013-11-15（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Fast Matching Method Based on Semantic Similarity for Short Texts