Partial Task Shuffle First Strategy for Spark

摘要：

　　Apache Spark is an in-memory distributed computing framework,which is more suitable for iterative jobs than MapReduce.However,the shuffle process needs to synchronize tasks between nodes,which may lead to waste the computing resources of the cluster and ultimately reduce the computing performance of the cluster.This is an important reason to limit the performance of Spark.In this paper,we proposes a Partial Task Shuffle First(PTSF)Strategy to dynamically generate Shuffle Write tasks and perform Shuffle operations on partial completed tasks.The strategy increases the parallel degrees of data calculation and transmission,lowering the peak of the Shuffle stage,allowing the cluster to be more balanced in the course of the operation.Finally,experiments show that the proposed strategy can improve Shuffle execution efficiency.

关键词： Big data spark shuffle task

作者: Tianlei Zhou Yuyang Wang

作者单位: School of computer science and technology,Chongqing University of Posts and Telecommunications,Chongqing,400065,China;Chongqing Engineering Research Center of Mobile Internet Data Application,Chongqing,400065,China

会议类型: 国际会议

会议名称: 2019 4th International Conference on Automatic Control and Mechatronic Engineering (ACME 2019) 2019年第四届自动控制与机电工程国际会议(ACME 2019)

会议地点: 重庆

会议语种:英文

页码: 66-72

在线出版日期: 2019-05-30（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Partial Task Shuffle First Strategy for Spark