A Scalable, Low Network Overhead Data Routing Algorithm for Cluster Deduplication

摘要：

　　Inline cluster deduplication technique has been widely used in disk-based backup systems and data centers to improve storage efficiency.Data routing algorithm has a crucial impact on the performance such as deduplication factor, throughput and scalability in a cluster deduplication system.In this paper, we propose a stateful data routing algorithm called DS-Dedupe.To make full use of the similarity in data streams, DS-Dedupe first assign data chunks by a consistent hash, during which a super-chunk granularity similarity index is built up in each client node to trace the super-chunks that have been routed.When routing new super-chunks, we calculate a similarity coefficient according to the index, and compare it to a threshold to determine whether it should be assigned directly to storage node or by the consistent hash, thus can strike a sensible tradeoff between deduplication factor and network overhead.Our experiment results on two real-world datasets Mail and Linux demonstrate that DS-Dedupe achieved a high deduplication factor as the stateful strategy while the cost of communication overhead is eliminated significantly and as low as a stateless one.Besides, as data routing is operated by each client node, which ensures systems scalability by avoiding metadata server becoming a bottleneck.

关键词： data deduplication data routing algorithm scalability network overhead

作者: Zhen Sun Nong Xiao Fang Liu

作者单位: State Key Laboratory of High Performance Computing, National University of Defense Technology Changsha, China

会议类型: 国际会议

会议名称: 第十二届全国博士生学术年会——计算机科学与技术专题

会议地点: 昆明

会议语种:英文

页码: 66-74

在线出版日期: 2014-05-01（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Scalable, Low Network Overhead Data Routing Algorithm for Cluster Deduplication