会议专题

An improved Similarity Measure For Chinese Text Clustering

  Similarity measure between documents is a pivotal step in text processing filed.Traditional similarity just considers one aspect of the text feature.A new similarity measure proposed in this paper takes statistics information and part of speech of feature terms into account.The proportion of statistics information and semantic,importance of different part of speech are obtained through experiment.K-means algorithm and its variants are widely used for text clustering,especially in large dataset.The choice of initial cluster centers is important,which can affect iterations and cluster quality.We proposed a new method based on previous researches.The method selects initial cluster center by combining maximum distance and statistical features.The experiments show that the improved method improves cluster quality in terms of F-measure,and has a less time consumption.

similarity measure part of speech initial cluster center LDA TF-IDF

Shaolei Zhang Zhong Wang Wei Huang

Xian Research Institute of High Technology, Xian, China

国际会议

2016 2nd International Conference on Mechanical, Electronic and Information Technology Engineering(2016机械、电子和信息技术国际会议)

重庆

英文

141-144

2016-03-21(万方平台首次上网日期,不代表论文的发表时间)