会议专题

A Parallel Algorithm for Mining Frequent Itemsets Based on MapReduce

  In big data analysis, frequent itemsets mining plays a key role in mining associations, correlations and causality.Since some traditional frequent itemsets mining algorithms are unable to handle massive small files datasets effectively, such as high memory cost, high I/O overhead, and low computing performance, we propose a novel parallel frequent itemsets mining algorithm based on the FP-Growth algorithm and discuss its applications in this paper.First, we introduce a small files processing strategy for massive small files datasets to compensate defects of low read-write speed and low processing efficiency in Hadoop.Moreover, we use MapReduce to redesign the FP-Growth algorithm for implementing parallel computing, thereby improving the overall performance of frequent itemsets mining.Finally, we apply the proposed algorithm to the association analysis of the data from the national college entrance examination and admission of China.The experimental results show that the proposed algorithm is feasible and valid for a good speedup and a higher mining efficiency, and can meet the actual requirements of frequent itemsets mining for massive small files datasets.

Big data analysis Frequent itemsets mining Parallel FP-Growth Small files problem Hadoop MapReduce

Dawen Xia Zhuobo Rong Yanhui Zhou Zili Zhang

School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric

国际会议

第十二届全国博士生学术年会——计算机科学与技术专题

昆明

英文

359-367

2014-05-01(万方平台首次上网日期,不代表论文的发表时间)