会议专题

The Checkpoint-Timing for Backward Fault-Tolerant Schemes

  To improve the performance of the backward fault tolerant scheme in the long-running parallel application,a general checkpoint-timing method was proposed to determine the unequal checkpointing interval according to an arbitrary failure rate,to reduce the total execution time.Firstly,a new model was introduced to evaluate the mean expected execution time.Secondly,the optimality condition was derived for the constant failure rate according to the calculation model,and the optimal equal checkpointing interval can be obtained easily.Subsequently,a general method was derived to determine the checkpointing timing for the other failure rate.The final results shown the proposal is practical to trade-off the re-processing overhead and the checkpointing overhead in the backward fault-tolerant scheme.

Parallel computation Fault tolerance Checkpointing Failure rate

Min Zhang

Lianyungang JARI Electronics Co.,Ltd.of CSIC,Lianyungang,China

国际会议

the 12th Conference on Advanced Computer Architecture?(ACA 2018)(2018年全国计算机体系结构学术年会)

辽宁营口

英文

210-218

2018-08-10(万方平台首次上网日期,不代表论文的发表时间)