会议专题

CDMCR:Multi-level Fault-tolerant System for Distributed Applications in Cloud

  Cloud provides users with a new model of utilizing the computing infrastructure with the ability to perform parallel and distributed computations using elastic large virtual cluster.However,the multilevel and complex features make cloud computing system more prone to failure.In this paper we present a multi-level fault-tolerant system for distributed applications in cloud named CDMCR.The CDMCR system backups the complete state of applications periodically with a snapshot-based distributed checkpointing protocol,including file system state.Thus,we cannot only recover processes but also rollback data.A multi-level recovery strategy is proposed which includes process-level recovery,virtual machine (VM) recreation and host rescheduling,enabling comprehensive and efficient fault tolerance for different components in cloud.We deploy CDMCR as PaaS,so that users can be liberated from node management and system configuration,and get access to fault-tolerant service conveniently.We have implemented this system based on the Xen virtualization platform and the OpenNebula cloud platform.Experiments on the prototype demonstrate the correctness of our system.Analysis shows that CDMCR does not cause message loss or data loss,and the backup time remains nearly constant as the number of nodes increases on virtual cluster.

Cloud Virtual cluster Distributed applications Fault-tolerant

国内会议

第八届中国可信计算与信息安全学术会议

湖北恩施

英文

1-10

2014-09-13(万方平台首次上网日期,不代表论文的发表时间)