Skip to content

Instantly share code, notes, and snippets.

@ThomDietrich
Last active August 12, 2020 22:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ThomDietrich/ca4c7b943f294a4274fbc4e1d68bfb7f to your computer and use it in GitHub Desktop.
Save ThomDietrich/ca4c7b943f294a4274fbc4e1d68bfb7f to your computer and use it in GitHub Desktop.
Benchmark to understand the effect of compressed archives on a restic repository
#!/bin/bash
BASE="$(pwd)/temp_test_deduplication"
SOURCE="$BASE/input"
REPO_BASE="$BASE/repo"
NUM_FILES=16
FILE_SIZE="8M"
export RESTIC_PASSWORD="password123"
############################################################
echo "Starting with clean folder..."
TEMP="$BASE/temp"
rm -rf "$BASE"
rm -rf "$SOURCE" && mkdir -p "$SOURCE"
rm -rf "$TEMP" && mkdir -p "$TEMP"
rm -rf "$REPO_BASE"*
echo -e "\nInitializing restic repos..."
restic init --repo=$REPO_BASE-input
restic init --repo=$REPO_BASE-gzip
restic init --repo=$REPO_BASE-bzip2
restic init --repo=$REPO_BASE-xz
restic init --repo=$REPO_BASE-rsyncable-gzip
restic init --repo=$REPO_BASE-rsyncable-pigz
restic init --repo=$REPO_BASE-rsyncable-zstd
for i in $(seq -f "%03g" 1 $NUM_FILES)
do
INDEX=$(cat /dev/urandom | tr -dc 'a-z0-9' | head -c 8)
echo "============================================================"
echo "Adding file $i under $SOURCE/$INDEX.txt"
cat /dev/urandom | tr -dc '[:alnum:] \n' | head -c $FILE_SIZE > "$SOURCE/$INDEX.txt"
ls -lh "$SOURCE"
REPO="$REPO_BASE-input"
echo -e "\n$REPO"
restic --repo=$REPO backup $SOURCE
for ALGO in gzip bzip2 xz; do
echo "------------------------------------------------------------"
REPO="$REPO_BASE-$ALGO"
echo -e "\n$REPO"
/usr/bin/time -f "Compression took %e seconds" \
tar -cv --$ALGO -f $TEMP/archive.tar.z $SOURCE
echo
restic --repo=$REPO backup $TEMP
rm -rf $TEMP && mkdir $TEMP
done
echo "------------------------------------------------------------"
REPO="$REPO_BASE-rsyncable-gzip"
echo -e "\n$REPO"
#tar -cv $SOURCE | gzip --rsyncable > $TEMP/archive.tar.z
#GZIP='--rsyncable' tar -cvzf $TEMP/archive.tar.gz $SOURCE
/usr/bin/time -f "Compression took %e seconds" \
tar -cv --use-compress-program="gzip --rsyncable" -f $TEMP/archive.tar.z $SOURCE
echo
restic --repo=$REPO backup $TEMP
rm -rf $TEMP && mkdir $TEMP
echo "------------------------------------------------------------"
REPO="$REPO_BASE-rsyncable-pigz"
echo -e "\n$REPO"
#tar -cv $SOURCE | pigz --rsyncable > $TEMP/archive.tar.z
/usr/bin/time -f "Compression took %e seconds" \
tar -cv --use-compress-program="pigz --rsyncable" -f $TEMP/archive.tar.z $SOURCE
echo
restic --repo=$REPO backup $TEMP
rm -rf $TEMP && mkdir $TEMP
echo "------------------------------------------------------------"
# Attention: rsyncable introduced in https://github.com/facebook/zstd/releases/tag/v1.3.8
REPO="$REPO_BASE-rsyncable-zstd"
echo -e "\n$REPO"
#tar -cv $SOURCE | pigz --rsyncable > $TEMP/archive.tar.z
/usr/bin/time -f "Compression took %e seconds" \
tar -cv --use-compress-program="zstd --rsyncable" -f $TEMP/archive.tar.z $SOURCE
echo
restic --repo=$REPO backup $TEMP
rm -rf $TEMP && mkdir $TEMP
done
rm -rf $TEMP
echo -e "\nFinal repo sizes, compared to file input of $NUM_FILES of $FILE_SIZE each:"
du -hs $BASE/*
@ThomDietrich
Copy link
Author

ThomDietrich commented Aug 11, 2020

The benchmark resembles a typical scenario. An application changes parts of its data over time and the provided backup command or script creates a compressed archive. A compressed archive is generally preferred as it is easier to handle and uses up less storage. Problems arise as soon as backup tools like restic are used. These tools identify changes between backup runs and only store differences. Two compressed archives with little content difference might be identified as completely different and the backup repository explodes in disk size.

The above script tests the effect of different compression algorithms on the repository size. The benchmark script increases archive size over multiple backup runs. In one test run with 16 loop runs and 8 MB file size the resulting repositories had the following sizes:

% du -hs *
129M	/home/th/restic-test/temp_test_deduplication/input
130M	/home/th/restic-test/temp_test_deduplication/repo-input
748M	/home/th/restic-test/temp_test_deduplication/repo-gzip
758M	/home/th/restic-test/temp_test_deduplication/repo-bzip2
835M	/home/th/restic-test/temp_test_deduplication/repo-xz
149M	/home/th/restic-test/temp_test_deduplication/repo-rsyncable-gzip
151M	/home/th/restic-test/temp_test_deduplication/repo-rsyncable-pigz
292M	/home/th/restic-test/temp_test_deduplication/repo-rsyncable-zstd

It is therefore highly recommended to use compression algorithms with the "rsyncable" option or send uncompressed backup files to restic.


Side aspect: How fast were the individual compression algorithms? On the test machine and with the generated test data the compression of all 16 files took:

/home/th/restic-test/temp_test_deduplication/repo-gzip
Compression took 6.26 seconds

/home/th/restic-test/temp_test_deduplication/repo-bzip2
Compression took 13.75 seconds

/home/th/restic-test/temp_test_deduplication/repo-xz
Compression took 58.08 seconds

/home/th/restic-test/temp_test_deduplication/repo-rsyncable-gzip
Compression took 7.02 seconds

/home/th/restic-test/temp_test_deduplication/repo-rsyncable-pigz
Compression took 1.20 seconds

/home/th/restic-test/temp_test_deduplication/repo-rsyncable-zstd
Compression took 0.57 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment