A Fast Matching Method Based on Semantic Similarity for Short Texts
As the emergence of various social media, short texts, such as weibos and instant messages, are very prevalent on todays websites.In order to mine semantically similar information from massive data, a fast and efficient matching method for short texts has become an urgent task.However, the conventional matching methods suffer from the data sparsity in short documents.In this paper, we propose a novel matching method, referred as semantically similar hashing (SSHash).The basic idea of SSHash is to directly train a topic model from corpus rather than documents, then project texts into hash codes by using latent features.The major advantages of SSHash are that 1) SSHash alleviates the sparse problem in short texts, because we obtain the latent features from whole corpus regardless of document level; and 2) SSHash can accomplish similar matching in an interactive real time by introducing hash method.We carry out extensive experiments on real-world short texts.The results demonstrate that our method significantly outperforms baseline methods on several evaluation metrics.
Short Text Semantically Similar Matching Topic Model Hash
Jiaming Xu Pengcheng Liu Gaowei Wu Zhengya Sun Bo Xu Hongwei Hao
Institute of Automation, Chinese Academy of Sciences, 100190, Beijing, P.R.China
国际会议
Second CCF Conference,NLPCC2013(第二届自然语言处理与中文计算会议)
重庆
英文
299-309
2013-11-15(万方平台首次上网日期,不代表论文的发表时间)