Clustering Algorithm Based on Condensed Set Dissimilarity for High Dimensional Sparse Data of Categorical Attributes

摘要：

Categorical data clustering is always challenging, especially when data is high dimensional and sparse. This paper proposes a new algorithm, named as CABOC, for clustering high dimensional sparse data with categorical attributes. Based on a new defined concept Condensed Set Dissimilarity, the algorithm computes the dissimilarity of all the objects with sparse categorical attributes in a set directly. Furthermore, the algorithm only records a Condensed Set Reduction vector of the set during the computation process, which is defined to simply and accurately represent the necessary information of all the objects with sparse categorical attributes in the set for the clustering. So the computational complexity of the algorithm is low. A numeric example for customer cluster analysis illustrates the effectiveness of the algorithm.

关键词： categorical attributes high dementional sparse data Condensed Set Dissimilarity Condensed Set Reduction vector

作者: Sen Wu Juanjuan Liu Guiying Wei

作者单位: School of Economics and Management University of Science and Technology Beijing Beijing, China School of Economics and Management niversity of Science and Technology Beijing Beijing, China

会议类型: 国际会议

会议名称: 2011 3rd International Conference on Advanced Computer Control(2011年IEEE第三届高端计算机控制国际会议 ICACC2011)

会议地点: 哈尔滨

会议语种:英文

页码: 445-448

在线出版日期: 2011-01-18（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Clustering Algorithm Based on Condensed Set Dissimilarity for High Dimensional Sparse Data of Categorical Attributes