A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing

摘要：

Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2log p.k((β+2γ)m+α) to (1+O(1/√m)).k(β+2γ)m, where a is the communication latency, 1/β is the network bandwidth between processes, 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.

关键词： Checkpoint diskless checkpointing fault tolerance high performance computing parallel distributed systems Reed-Solomon encoding

作者: Zizhong Chen Jack Dongarra

作者单位: Colorado School of Mines Department of Mathematical Computer Sciences Golden, CO 80401-1887, USA University of Tennessee, Knoxville Department of Electrical Engineering Computer Science Knoxville,

会议类型: 国际会议

会议名称: 11th IEEE High Assurance Systems Engineering Symposium(HASE 2008)(第十一届IEEE高可信系统工程国际研讨会)

会议地点: 南京

会议语种:英文

页码: 71-79

在线出版日期: 2008-12-03（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing