Design of global data deduplication for a scale-out distributed storage system

Cited 14 time in webofscience Cited 11 time in scopus
  • Hit : 572
  • Download : 0
Scale-out distributed storage systems can uphold balanced data growth in terms of capacity and performance on an on-demand basis. However, it is a challenge to store and manage large sets of contents being generated by the explosion of data. One of the promising solutions to mitigate big data issues is data deduplication, which removes redundant data across many nodes of the storage system. Nevertheless, it is non-trivial to apply a conventional deduplication design to the scale-out storage due to the following root causes. First, chunk-lookup for deduplication is not as scalable and extendable as the underlying storage system supports. Second, managing the metadata associated to deduplication requires a huge amount of design and implementation modifications of the existing distributed storage system. Lastly, the data processing and additional I/O traffic imposed by deduplication can significantly degrade performance of the scale-out storage. To address these challenges, we propose a new deduplication method, which is highly scalable and compatible with the existing scale-out storage. Specifically, our deduplication method employs a double hashing algorithm that leverages hashes used by the underlying scale-out storage, which addresses the limits of current fingerprint hashing. In addition, our design integrates the meta-information of file system and deduplication into a single object, and it controls the deduplication ratio at online by being aware of system demands based on post-processing. We implemented the proposed deduplication method on an open source scale-out storage. The experimental results show that our design can save more than 90% of the total amount of storage space, under the execution of diverse standard storage workloads, while offering the same or similar performance, compared to the conventional scale-out storage.
Publisher
Institute of Electrical and Electronics Engineers Inc.
Issue Date
2018-07-02
Language
English
Citation

38th IEEE International Conference on Distributed Computing Systems, ICDCS 2018, pp.1063 - 1073

DOI
10.1109/ICDCS.2018.00106
URI
http://hdl.handle.net/10203/269571
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 14 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0