An improved Similarity Measure For Chinese Text Clustering
Similarity measure between documents is a pivotal step in text processing filed.Traditional similarity just considers one aspect of the text feature.A new similarity measure proposed in this paper takes statistics information and part of speech of feature terms into account.The proportion of statistics information and semantic,importance of different part of speech are obtained through experiment.K-means algorithm and its variants are widely used for text clustering,especially in large dataset.The choice of initial cluster centers is important,which can affect iterations and cluster quality.We proposed a new method based on previous researches.The method selects initial cluster center by combining maximum distance and statistical features.The experiments show that the improved method improves cluster quality in terms of F-measure,and has a less time consumption.
similarity measure part of speech initial cluster center LDA TF-IDF
Shaolei Zhang Zhong Wang Wei Huang
Xian Research Institute of High Technology, Xian, China
国际会议
重庆
英文
141-144
2016-03-21(万方平台首次上网日期,不代表论文的发表时间)