Spherical Credibilistic Clustering Algorithm for Text Data
Unlabelled document collections are becoming increasingly common and available, and mining such data sets is essential and critical to information retrieval.Using words as features, text documents are often represented as high-dimensional and sparse vectors. For example, a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical credibility clustering algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As one instance of fuzzy clustering algorithms, credibility measure is introduced into text clustering for the first time,and is combined with cosine similarity to demonstrate the degree of belongingness or compatibility. Moreover,the proposed algorithm is superior to other fuzzy clustering algorithms in computational complexity, making the algorithm practical in clustering large data sets. The incorporation of credibility also allows the suppression of noise, which is common in text data sets. The selections of some parameters and the analysis of complexity are investigated and the computational experiments are given to show the performance of the proposed algorithm.
Text mining spherical credibilistic clustering high-dimensional data concept vectors
Xi Wang Shiyao Chen Jian Zhou
Department of Industrial Engineering, Tsinghua University, Beijing 100084, China Department of Automation, Tsinghua University, Beijing 100084, China
国际会议
江西庐山
英文
80-87
2007-10-10(万方平台首次上网日期,不代表论文的发表时间)