Clustering Algorithm Based on Condensed Set Dissimilarity for High Dimensional Sparse Data of Categorical Attributes
Categorical data clustering is always challenging, especially when data is high dimensional and sparse. This paper proposes a new algorithm, named as CABOC, for clustering high dimensional sparse data with categorical attributes. Based on a new defined concept Condensed Set Dissimilarity, the algorithm computes the dissimilarity of all the objects with sparse categorical attributes in a set directly. Furthermore, the algorithm only records a Condensed Set Reduction vector of the set during the computation process, which is defined to simply and accurately represent the necessary information of all the objects with sparse categorical attributes in the set for the clustering. So the computational complexity of the algorithm is low. A numeric example for customer cluster analysis illustrates the effectiveness of the algorithm.
categorical attributes high dementional sparse data Condensed Set Dissimilarity Condensed Set Reduction vector
Sen Wu Juanjuan Liu Guiying Wei
School of Economics and Management University of Science and Technology Beijing Beijing, China School of Economics and Management niversity of Science and Technology Beijing Beijing, China
国际会议
2011 3rd International Conference on Advanced Computer Control(2011年IEEE第三届高端计算机控制国际会议 ICACC2011)
哈尔滨
英文
445-448
2011-01-18(万方平台首次上网日期,不代表论文的发表时间)