Scaling Information-Theoretic Text Clustering:A Sampling-based Approximate Method
Info-Kmeans,a K-means clustering method employing KL-divergence as the proximity function,is one of the representative methods in information-theoretic clustering.With the explosive growth of online texts such as online reviews and user-generated content,the text is becoming more sparse and much bigger,which poses significant challenges on both effectiveness and efficiency issues of text clustering.In our prior work,we presented a Summation-bAsed Incremental Learning(SAIL)algorithm,which can avoid the zero-feature dilemma of highly sparse texts.In this paper,we propose a sampling-based approximate approach for scaling SAIL algorithm to deal with the large-scale of texts.Particularly,an instance-level random sampling is invoked to reduce the number of instances to be examined during each iteration,which substantially speeds up the clustering on big text data.Furthermore,we prove that the margin of errors introduced by random sampling can be controlled in a small range.Extensive experiments on eight real-life text datasets demonstrate the advantage of the proposed sampling-based approximate clustering method.In particular,our method shows merits in both effectiveness and efficiency on clustering performance.
Text Clustering K-means KL-divergence Random Sampling Approximate Algorithm
Zhexi Xu Zhiang Wu Jie Cao Hengnong Xuan
School of Information Engineering,Nanjing University of Finance and Economics,Nanjing,China Jiangsu Provincial Key Lab.of E-Business,Nanjing University of Finance and Economics,China
国际会议
2014 2nd International Conference on Advanced Cloud and Big Data (CBD 2014)(2014年先进云计算和大数据国际会议)
安徽黄山
英文
18-25
2014-11-20(万方平台首次上网日期,不代表论文的发表时间)