Skip to content

Instantly share code, notes, and snippets.

@fbeauchamp
Last active June 11, 2023 08:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fbeauchamp/4d11f0b968d54d3c5ce88b291f8d0949 to your computer and use it in GitHub Desktop.
Save fbeauchamp/4d11f0b968d54d3c5ce88b291f8d0949 to your computer and use it in GitHub Desktop.

Deduplication

Goal

(WORM) Write Once Read Multiple, use less storage by keeping track of data parts and not writing them multiple times on storage

Definitions

in band / offline

in band deduplication means with handle the duplicate before writing a new copy on disk. It can be expensive, but don't use more space than needed

offline deduplication is when deduplication is handled aynschronously, using IOPs when available. It's less costly , but use more space since data are first written duplicated.

block level / file level

block level is when we used file block , either of fixed size or variable size as data to deduplicate. It can be inneficient on some data, but does not mandate to know the content of the data to deduplicate.

file level is when we use a file contents, it depends on a precise knowledge of the data (partition type, storage). Can be tricky with huge file that changes a lot (like database files)

Existing Implementation

deduplication is built in or available by third party tools in some FS, like BTRFS , XFS or reiserFS

https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe for btrf

Proposition

I think we should handle the deduplication ourself to handle various storage backend and leverage this knowledge to limit data transfered between systems, especially when transferring to S3/glacier

in band, fixed size block (1-4MB) based deduplication

  • compute a hash of the block
  • if the block is not already in store
    • eventually crypt and compress block
    • store the block in a real filesystem
  • store the block usage (which backup use it, which offset in the disk source) in an index
  • when deleting a block , delete in the index. Delete the file only if it's not used anymore
  • use a storage independant lock store (redis) to protect blocks

V2

If native file level dedup is present : use it

  • proxy installed on remote, local FS

  • overload writeFile / outputStream to

    • compute the hash
    • if not already exists :
      • really write the file in the dedup folder hierarchy
      • add the checksum as an extanded attribute setfattr -n user.checksum.sha256 -v 267607e76403760d5a2c07863ae592273105514065c67f7d5e217b9497d5f9fc ./linked.json This will be accessible to all hard linked copy
    • else touch the file to mark its content as new (for immutability)
    • hardlink the file to its intended place
  • overload unlink ( WARNING : possible race condition ? )

    • get the number of links stat.nlink
    • if the file had exactly 2 link (itself + source ) :
      • get the hash from the extended attributes getfattr -n user.checksum.sha256 ./turbo.json
      • delete the hard link in the vhd folder
    • delete local
  • rename :

    • if dest exists :
      • unlink it properly before writing new file
    • move link, no need to update counters, we have the same number of links after

Bonus : merge only delete blocks if there aren't used anywhere else, it should speed up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment