A Parallel Algorithm for Mining Frequent Itemsets Based on MapReduce

摘要：

　　In big data analysis, frequent itemsets mining plays a key role in mining associations, correlations and causality.Since some traditional frequent itemsets mining algorithms are unable to handle massive small files datasets effectively, such as high memory cost, high I/O overhead, and low computing performance, we propose a novel parallel frequent itemsets mining algorithm based on the FP-Growth algorithm and discuss its applications in this paper.First, we introduce a small files processing strategy for massive small files datasets to compensate defects of low read-write speed and low processing efficiency in Hadoop.Moreover, we use MapReduce to redesign the FP-Growth algorithm for implementing parallel computing, thereby improving the overall performance of frequent itemsets mining.Finally, we apply the proposed algorithm to the association analysis of the data from the national college entrance examination and admission of China.The experimental results show that the proposed algorithm is feasible and valid for a good speedup and a higher mining efficiency, and can meet the actual requirements of frequent itemsets mining for massive small files datasets.

关键词： Big data analysis Frequent itemsets mining Parallel FP-Growth Small files problem Hadoop MapReduce

作者: Dawen Xia Zhuobo Rong Yanhui Zhou Zili Zhang

作者单位: School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric

会议类型: 国际会议

会议名称: 第十二届全国博士生学术年会——计算机科学与技术专题

会议地点: 昆明

会议语种:英文

页码: 359-367

在线出版日期: 2014-05-01（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Parallel Algorithm for Mining Frequent Itemsets Based on MapReduce