Reducing Communication Overhead in the High Performance Conjugate Gradient Benchmark on Tianhe-2
The High Performance Conjugate Gradient (HPCG) benchmark,proposed recently in 2013,has drawn increasingly more attention from both academia and industry.Unlike the High Performance Linpack (HPL) benchmark,which has a very high computation-to-communication ratio,HPCG contains both neighboring and global communication that may severely degrade the parallel performance.To reduce the communication overhead of neighboring communications,we overlap halo updates with halo-independent computations.To hide the cost of the global reductions in vector dot-products,we make use of two reformulated CG algorithms,namely the Gropp’s asynchronous CG and the pipelined CG.Some further optimizations are done to decrease the extra overhead introduced in the reformulated CG algorithms.We show by experiments on the world’s largest heterogeneous system – Tianhe-2 that the optimized HPCG code scales to 256 nodes (49,920 cores) with a nearly ideal weak scalability of over 90%and an aggregate performance of 10.51Tflops.
HPCG communication-computation overlap pipelined CG asynchronous CG Tianhe-2
Fangfang Liu Chao Yang Yiqun Liu Xianyi Zhang Yutong Lu
Institute of Software,Chinese Academy of Sciences,Beijing 100190,China Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;State Key Laboratory of Compu Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy Dept.of Computer Science & Technology,National University of Defense Technology,Changsha,Hunan 41007
国际会议
湖北咸宁
英文
13-18
2014-11-24(万方平台首次上网日期,不代表论文的发表时间)