A PARALLEL JOIN ALGORITHM FOR BIG DATA BASED ON MAPREDUCE

摘要：

　　Data mining is an important aspect in cloud computing which allows a huge amount of data to be processed.MapReduce based on Hadoop is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance.Join is a very important function in database and data analysis,but it also involves many difficulties and problems.Among the problems,how to optimize its methods,thereby to improve its performance,is the most essential and important.Many studies are focused on the map or reduce functions,but Hadoop framework provides many kinds of mechanism to realize numerous functions of data processing.In this paper,we analyzed the detailed procedures between map and reduce,which is called shuffle,put up some points to optimize in shuffle,and then proposed a new thought of join.Took full advantage of MapReduce mechanism,finally conclude a new join algorithm called Value-Sort Join.The performance comparison proved that this algorithm is better.

关键词： MapReduce Parallel calculation Big data Join

作者: Xinxin Ge Bin Wu Jia Huang Yu Jia Yahong Guo

作者单位: Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,Beijing University Institute of Network Technology,Beijing University of Posts and Telecommunications,Beijing 100876,Ch

会议类型: 国际会议

会议名称: 2012 2nd IEEE International Conference on Cloud Computing and Intelligence Systems (2012年第2届IEEE云计算与智能系统国际会议(IEEE CCIS2012))

会议地点: 杭州

会议语种:英文

页码: 414-418

在线出版日期: 2012-10-30（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A PARALLEL JOIN ALGORITHM FOR BIG DATA BASED ON MAPREDUCE