A Parallel Algorithm for Mining Frequent Itemsets Based on MapReduce
In big data analysis, frequent itemsets mining plays a key role in mining associations, correlations and causality.Since some traditional frequent itemsets mining algorithms are unable to handle massive small files datasets effectively, such as high memory cost, high I/O overhead, and low computing performance, we propose a novel parallel frequent itemsets mining algorithm based on the FP-Growth algorithm and discuss its applications in this paper.First, we introduce a small files processing strategy for massive small files datasets to compensate defects of low read-write speed and low processing efficiency in Hadoop.Moreover, we use MapReduce to redesign the FP-Growth algorithm for implementing parallel computing, thereby improving the overall performance of frequent itemsets mining.Finally, we apply the proposed algorithm to the association analysis of the data from the national college entrance examination and admission of China.The experimental results show that the proposed algorithm is feasible and valid for a good speedup and a higher mining efficiency, and can meet the actual requirements of frequent itemsets mining for massive small files datasets.
Big data analysis Frequent itemsets mining Parallel FP-Growth Small files problem Hadoop MapReduce
Dawen Xia Zhuobo Rong Yanhui Zhou Zili Zhang
School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric School of Computer and Information Science Southwest University No.2, Tiansheng Road, Beibei Distric
国际会议
昆明
英文
359-367
2014-05-01(万方平台首次上网日期,不代表论文的发表时间)