An improved Similarity Measure For Chinese Text Clustering

摘要：

　　Similarity measure between documents is a pivotal step in text processing filed.Traditional similarity just considers one aspect of the text feature.A new similarity measure proposed in this paper takes statistics information and part of speech of feature terms into account.The proportion of statistics information and semantic,importance of different part of speech are obtained through experiment.K-means algorithm and its variants are widely used for text clustering,especially in large dataset.The choice of initial cluster centers is important,which can affect iterations and cluster quality.We proposed a new method based on previous researches.The method selects initial cluster center by combining maximum distance and statistical features.The experiments show that the improved method improves cluster quality in terms of F-measure,and has a less time consumption.

关键词： similarity measure part of speech initial cluster center LDA TF-IDF

作者: Shaolei Zhang Zhong Wang Wei Huang

作者单位: Xian Research Institute of High Technology, Xian, China

会议类型: 国际会议

会议名称: 2016 2nd International Conference on Mechanical, Electronic and Information Technology Engineering(2016机械、电子和信息技术国际会议)

会议地点: 重庆

会议语种:英文

页码: 141-144

在线出版日期: 2016-03-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

An improved Similarity Measure For Chinese Text Clustering