会议专题

A Scalable, Low Network Overhead Data Routing Algorithm for Cluster Deduplication

  Inline cluster deduplication technique has been widely used in disk-based backup systems and data centers to improve storage efficiency.Data routing algorithm has a crucial impact on the performance such as deduplication factor, throughput and scalability in a cluster deduplication system.In this paper, we propose a stateful data routing algorithm called DS-Dedupe.To make full use of the similarity in data streams, DS-Dedupe first assign data chunks by a consistent hash, during which a super-chunk granularity similarity index is built up in each client node to trace the super-chunks that have been routed.When routing new super-chunks, we calculate a similarity coefficient according to the index, and compare it to a threshold to determine whether it should be assigned directly to storage node or by the consistent hash, thus can strike a sensible tradeoff between deduplication factor and network overhead.Our experiment results on two real-world datasets Mail and Linux demonstrate that DS-Dedupe achieved a high deduplication factor as the stateful strategy while the cost of communication overhead is eliminated significantly and as low as a stateless one.Besides, as data routing is operated by each client node, which ensures systems scalability by avoiding metadata server becoming a bottleneck.

data deduplication data routing algorithm scalability network overhead

Zhen Sun Nong Xiao Fang Liu

State Key Laboratory of High Performance Computing, National University of Defense Technology Changsha, China

国际会议

第十二届全国博士生学术年会——计算机科学与技术专题

昆明

英文

66-74

2014-05-01(万方平台首次上网日期,不代表论文的发表时间)