A PARALLEL JOIN ALGORITHM FOR BIG DATA BASED ON MAPREDUCE
Data mining is an important aspect in cloud computing which allows a huge amount of data to be processed.MapReduce based on Hadoop is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance.Join is a very important function in database and data analysis,but it also involves many difficulties and problems.Among the problems,how to optimize its methods,thereby to improve its performance,is the most essential and important.Many studies are focused on the map or reduce functions,but Hadoop framework provides many kinds of mechanism to realize numerous functions of data processing.In this paper,we analyzed the detailed procedures between map and reduce,which is called shuffle,put up some points to optimize in shuffle,and then proposed a new thought of join.Took full advantage of MapReduce mechanism,finally conclude a new join algorithm called Value-Sort Join.The performance comparison proved that this algorithm is better.
MapReduce Parallel calculation Big data Join
Xinxin Ge Bin Wu Jia Huang Yu Jia Yahong Guo
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,Beijing University Institute of Network Technology,Beijing University of Posts and Telecommunications,Beijing 100876,Ch
国际会议
杭州
英文
414-418
2012-10-30(万方平台首次上网日期,不代表论文的发表时间)