scy/borg-restic-dedup.md

## borg-restic-dedup.md

      
    Raw
  

              borg-restic-dedup.md
            
          
    This short post exists because I’ve asked whether Borg’s deduplication can only deal with data being appended to a file, not with data inserted at a position before the end of the file.
A discussion emerged, and I said that I’d look into it some more.
tl;dr

I was wrong, Borg deals with that, although apparently not quite as efficient as restic.
My initial test (that led to my tweet) was simply set up wrong.
Testing Borg’s deduplication

# Create two files with 100 MB of random data as well as an (encrypted) Borg repo.
$ head -c 100M /dev/urandom > a
$ head -c 100M /dev/urandom > b
$ borgbackup init -e repokey-blake2 sizetest

# Add first file.
$ borgbackup create sizetest::one a
$ du -sh sizetest
124M    sizetest

# Add first file again.
$ borgbackup create sizetest::two a
$ du -sh sizetest
124M    sizetest
# Deduplicated, no significant size change.

# Add second file.
$ borgbackup create sizetest::three b
$ du -sh sizetest
244M    sizetest
# Size roughly doubled, as expected.

# Prepend "x\n" to first file, save as "a1", add that.
$ cat <(echo x) a > a1
$ borgbackup create sizetest::four a1
$ du -sh sizetest
250M    sizetest
# Size has only increased slightly.

# Concatenate all three files that are already in the repo to a new file "c", add that.
$ cat a a1 b > c
$ borgbackup create sizetest::five c
$ du -sh sizetest
280M    sizetest
# Size has increased substantially (because of chunking), but deduplication still shows.

Comparing it with restic

# Create a repo.
$ rm -rf sizetest
$ restic -r sizetest init

# Add first file.
$ restic -r sizetest backup a
$ du -sh sizetest
106M    sizetest

# Add first file again.
$ brestic -r sizetest backup a
$ du -sh sizetest
106M    sizetest
# Deduplicated, no significant size change.

# Add second file.
$ restic -r sizetest backup b
$ du -sh sizetest
211M    sizetest
# Size roughly doubled, as expected.

# Add the file that has two bytes prepended.
$ restic -r sizetest backup a1
$ du -sh sizetest
213M    sizetest
# Size has only increased slightly.

# Add the file that’s the concatenation of all three already in the repo.
$ restic -r sizetest backup c
$ du -sh sizetest
218M    sizetest
# Again, size has increased only slightly. Notably, restic’s repo is now 22 % smaller than Borg’s.

My initial test setup

The reason I had looked into Borg’s deduplication in the first place was that I was evaluating the following scenario:

Create backups of volume A into a Borg repo on volume B.
Then, create backups of volume B into a Borg repo on volume C.
Will files present on volume A and (outside of the A repo) on volume B be deduplicated against each other?

In order for this to work at all, surely A’s repo must be unencrypted, so that the contents of duplicate files show up in the repo’s data directory and can be deduplicated against other files on volume B (where the repo resides).
So I’ve tried it:
$ rm -rf encrypted unencrypted
$ borgbackup init -e authenticated-blake2 unencrypted
$ borgbackup create unencrypted::one c
$ du -sh unencrypted
218M    unencrypted
$ borgbackup init -e repokey-blake2 encrypted
$ borgbackup create encrypted::one c
$ du -sh encrypted
207M    encrypted
$ borgbackup create encrypted::two unencrypted
$ du -sh encrypted
390M    encrypted

So, apparently, it’s not being deduplicated.
I haven’t done additional digging to find out why, but there are two things that came to mind:
First, I should probably have specified -C none, else the file contents might become unrecognizable due to compression.
However, I did some additional testing later on and -C none didn’t help either.
Which brings me to the second suspicion:
The file contents in the repo might actually be interspersed with metadata instead of a continual stream with the original contents.
If that’s the case, some deduplication might still occur, but as we’ve seen above, it’s not perfect and leads to a certain portion per chunk being redundant.
If these chunks are small enough, the redundancy might add up to such a degree that deduplication is pretty much useless.
By the way, I didn’t check this repo-in-a-repo setup with restic at all, because restic doesn’t support unencrypted repos.