Spherical Credibilistic Clustering Algorithm for Text Data

摘要：

Unlabelled document collections are becoming increasingly common and available, and mining such data sets is essential and critical to information retrieval.Using words as features, text documents are often represented as high-dimensional and sparse vectors. For example, a few thousand dimensions and a sparsity of 95 to 99％ is typical. In this paper, we study a certain spherical credibility clustering algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As one instance of fuzzy clustering algorithms, credibility measure is introduced into text clustering for the first time,and is combined with cosine similarity to demonstrate the degree of belongingness or compatibility. Moreover,the proposed algorithm is superior to other fuzzy clustering algorithms in computational complexity, making the algorithm practical in clustering large data sets. The incorporation of credibility also allows the suppression of noise, which is common in text data sets. The selections of some parameters and the analysis of complexity are investigated and the computational experiments are given to show the performance of the proposed algorithm.

关键词： Text mining spherical credibilistic clustering high-dimensional data concept vectors

作者: Xi Wang Shiyao Chen Jian Zhou

作者单位: Department of Industrial Engineering, Tsinghua University, Beijing 100084, China Department of Automation, Tsinghua University, Beijing 100084, China

会议类型: 国际会议

会议名称: 第一届中国智能计算大会

会议地点: 江西庐山

会议语种:英文

页码: 80-87

在线出版日期: 2007-10-10（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Spherical Credibilistic Clustering Algorithm for Text Data