A Scalable, Low Network Overhead Data Routing Algorithm for Cluster Deduplication
Inline cluster deduplication technique has been widely used in disk-based backup systems and data centers to improve storage efficiency.Data routing algorithm has a crucial impact on the performance such as deduplication factor, throughput and scalability in a cluster deduplication system.In this paper, we propose a stateful data routing algorithm called DS-Dedupe.To make full use of the similarity in data streams, DS-Dedupe first assign data chunks by a consistent hash, during which a super-chunk granularity similarity index is built up in each client node to trace the super-chunks that have been routed.When routing new super-chunks, we calculate a similarity coefficient according to the index, and compare it to a threshold to determine whether it should be assigned directly to storage node or by the consistent hash, thus can strike a sensible tradeoff between deduplication factor and network overhead.Our experiment results on two real-world datasets Mail and Linux demonstrate that DS-Dedupe achieved a high deduplication factor as the stateful strategy while the cost of communication overhead is eliminated significantly and as low as a stateless one.Besides, as data routing is operated by each client node, which ensures systems scalability by avoiding metadata server becoming a bottleneck.
data deduplication data routing algorithm scalability network overhead
Zhen Sun Nong Xiao Fang Liu
State Key Laboratory of High Performance Computing, National University of Defense Technology Changsha, China
国际会议
昆明
英文
66-74
2014-05-01(万方平台首次上网日期,不代表论文的发表时间)