Partial Task Shuffle First Strategy for Spark
Apache Spark is an in-memory distributed computing framework,which is more suitable for iterative jobs than MapReduce.However,the shuffle process needs to synchronize tasks between nodes,which may lead to waste the computing resources of the cluster and ultimately reduce the computing performance of the cluster.This is an important reason to limit the performance of Spark.In this paper,we proposes a Partial Task Shuffle First(PTSF)Strategy to dynamically generate Shuffle Write tasks and perform Shuffle operations on partial completed tasks.The strategy increases the parallel degrees of data calculation and transmission,lowering the peak of the Shuffle stage,allowing the cluster to be more balanced in the course of the operation.Finally,experiments show that the proposed strategy can improve Shuffle execution efficiency.
Big data spark shuffle task
Tianlei Zhou Yuyang Wang
School of computer science and technology,Chongqing University of Posts and Telecommunications,Chongqing,400065,China;Chongqing Engineering Research Center of Mobile Internet Data Application,Chongqing,400065,China
国际会议
重庆
英文
66-72
2019-05-30(万方平台首次上网日期,不代表论文的发表时间)