Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save ppflrs/336e49f8ae3843dc06cc3925940f3024 to your computer and use it in GitHub Desktop.
Save ppflrs/336e49f8ae3843dc06cc3925940f3024 to your computer and use it in GitHub Desktop.
Parallel download of blast databases using rsync+GNU Parallel
  1. Select which database you want to download, here I will use the nucleotide database: nt.

  2. Using rsync we will retrieve the name of the files composing the database from the NCBI server

rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz

  1. Using grep we filter the Warning/Welcome message and retain only the compressed files

rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz | grep '.tar.gz'

  1. The output of rsync --list-only is similar to the one from ls -l so we can use awk to extract the last column and append the ftp-route to the filename

rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}'

  1. The output of the one-liner can be redirected to a file or directly into an intermediate file (or directly fed to parallel)

rsync --list-only rsync://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz | grep '.tar.gz' | awk '{print "ftp.ncbi.nlm.nih.gov/blast/db/" $NF}' > nt.links.list

  1. Now we can use parallel to download several files at the same time. By using the option -j in parallel we can select how many parallel downloads we want to use, I will use 4.

cat nt.links.list | parallel -j4 'rsync -h --progress rsync://{} .'

Note: The number of parallel downloads should be chosen carefully as it depends on several factors as the speed of the connection or the disk writing speed. Also the number of simultaneous downloads shouldn't be too high to avoid being disconnected from the remote server.

  1. After the download it's complete we can decompress the files faster by using parallel

find . -name '*.gz' | parallel 'echo {}; tar -zxf {}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment