会议专题

A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing

Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2log p.k((β+2γ)m+α) to (1+O(1/√m)).k(β+2γ)m, where a is the communication latency, 1/β is the network bandwidth between processes, 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.

Checkpoint diskless checkpointing fault tolerance high performance computing parallel distributed systems Reed-Solomon encoding

Zizhong Chen Jack Dongarra

Colorado School of Mines Department of Mathematical Computer Sciences Golden, CO 80401-1887, USA University of Tennessee, Knoxville Department of Electrical Engineering Computer Science Knoxville,

国际会议

11th IEEE High Assurance Systems Engineering Symposium(HASE 2008)(第十一届IEEE高可信系统工程国际研讨会)

南京

英文

71-79

2008-12-03(万方平台首次上网日期,不代表论文的发表时间)