HiAL-Ckpt: A Hierarchical Application-Level Checkpointing for CPU-GPU Hybrid Systems
In light of its powerful computing capacity and high energy efficiency, GPU (graphics processing unit) has become a focus in the research field of HPC (High Performance Computing). CPU-GPU heterogeneous parallel systems have become a new development trend of super-computer. However, the inherent unreliability of the GPU hardware deteriorates the reliability of super-computer. We have researched on the fault-tolerance(FT) technique for CPU-GPU heterogeneous parallel systems, and introduced a new checkpointing mechanism, i.e., the hierarchical application-level checkpointing, for such systems. The basic idea of this new checkpointing mechanism is checkpointing at two independent levels, i.e., CPU level and GPU level, to tolerate CPU and GPU faults respectively. Based on the idea, we have also designed and implemented a hierarchical application-level checkpointing tool HiAL-Ckpt. Using this tool, programmers can insert two kinds of directives, i.e., CPU directives and GPU directives into a program, and the compiler will transform the directives into CPU or GPU checkpointing codes according to their nature. From the case study of SWIM, a test bench from spec2000 benchmark suite, we have demonstrated the validity of the hierarchical application-level checkpointing technique. The experimental results show that the falut-tolerance temporal cost of HiAL-Ckpt for SWIM is only 2.25%, compared with the executing time of SWIM without any FT work.
GPU:heterogeneous systems:fault-tolerance checkpointing
Xinhai Xu Yufei Lin Tao Tang Yisong Lin
National Laboratory for Parallel and Distributed Processing National University of Defense Technology Changsha,China
国际会议
The 5th International Conference on Computer Science & Education(第五届国际计算机新技术与教育学术研讨会 ICCSE10)
合肥
英文
1288-1292
2010-08-24(万方平台首次上网日期,不代表论文的发表时间)