Skip to content

Instantly share code, notes, and snippets.

@dch
Last active February 9, 2024 06:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dch/b5df789dfb7ee1e8b8afa6263605efc7 to your computer and use it in GitHub Desktop.
Save dch/b5df789dfb7ee1e8b8afa6263605efc7 to your computer and use it in GitHub Desktop.
tarsnap hacky parallel restore script
#!/bin/sh
# recover all files in parallel from the most recent archive
# MIT license
# https://git.io/vdrbG
# "works on my machine"
# lots of assumptions notably path length (strip-component)
# get the latest archive as our names can be sorted by time
ARCHIVE=`tarsnap --keyfile /tmp/tarsnap.key --list-archives | sort | tail -1`
# order the archives by descending size
FILES=`tarsnap --keyfile /tmp/tarsnap.key -tvf ${ARCHIVE} | cut -w -f 5,9 | sort -rn | cut -w -f 2`
# spawn 10 invocations in parallel (use -P 0 for unlimited)
echo $FILES | xargs -P 10 -n 1 -t \
time tarsnap \
--retry-forever \
-S \
--strip-components 6 \
--print-stats \
--humanize-numbers \
--keyfile /tmp/tarsnap.key \
--chroot \
-xv \
-f ${ARCHIVE}
# profit
@cperciva
Copy link

Leaving a comment here in case anyone finds this and tries to use it: This will use lots and lots of bandwdith if you have many files! It spawns a tarsnap process for each file in the archve (xargs -n 1) and each tarsnap process has to read all of the tar headers in the archive to find the right file. So it ends up being O(N^2) in the number of files.

@dch
Copy link
Author

dch commented Jan 31, 2024

A very good point - expensive in bandwidth. But will it get overall better throughput during restore? Are there alternative options for better throughput?

@cperciva
Copy link

cperciva commented Feb 9, 2024

If you have a small number of files then it might get more throughput. It would depend on the average size of the files you're downloading vs the overhead of downloading all of the tar headers (512 bytes * number of files). Running 10 processes in parallel I guess it would complete faster if the number of files is less than the average file size divided by 50 bytes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment