fbeauchamp/deduplication.md

## deduplication.md

      
    Raw
  

              deduplication.md
            
          
    Deduplication

Goal

(WORM) Write Once Read Multiple, use less storage by keeping track of data parts and not writing them multiple times on storage
Definitions

in band / offline

in band deduplication means with handle the duplicate before writing a new copy on disk. It can be expensive, but don't use more space than needed
offline deduplication is when deduplication is handled aynschronously, using IOPs when available. It's less costly , but use more space since data are first written duplicated.
block level / file level

block level is when we used file block , either of fixed size or variable size as data to deduplicate. It can be inneficient on some data, but does not mandate to know the content of the data to deduplicate.
file level is when we use a file contents, it depends on a precise knowledge of the data (partition type, storage). Can be tricky with huge file that changes a lot (like database files)
Existing Implementation

deduplication is built in or available by third party tools in some FS, like BTRFS , XFS or reiserFS
https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe for btrf
Proposition

I think we should handle the deduplication ourself to handle various storage backend and leverage this knowledge to limit data transfered between systems, especially when transferring to S3/glacier
in band, fixed size block (1-4MB) based deduplication

compute a hash of the block
if the block is not already in store

eventually crypt and compress block
store the block in a real filesystem


store the block usage (which backup use it, which offset in the disk source) in an index
when deleting a block , delete in the index. Delete the file only if it's not used anymore
use a storage independant lock store (redis) to protect blocks

V2

If native file level  dedup is present : use it


proxy installed on remote, local FS


overload writeFile / outputStream to

compute the hash
if not already exists :

really write the file in the dedup folder hierarchy
add the checksum as an extanded attribute  setfattr -n user.checksum.sha256 -v 267607e76403760d5a2c07863ae592273105514065c67f7d5e217b9497d5f9fc ./linked.json This will be accessible to all hard linked copy


else touch the file to mark its content as new (for immutability)
hardlink the file to its intended place


overload unlink ( WARNING : possible race condition ? )

get the number of links stat.nlink
if the file had exactly 2 link (itself + source ) :

get the hash from the extended attributes getfattr -n user.checksum.sha256 ./turbo.json
delete the hard link in the vhd folder


delete local


rename :

if dest exists :

unlink it properly before writing new file


move link, no need to update counters, we have the same number of links after


Bonus : merge only delete blocks if there aren't used anywhere else, it should speed up