Skip to content

Instantly share code, notes, and snippets.

@lh3
Last active December 30, 2017 22:22
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lh3/acf621374581be03f0de654c45cf5ca8 to your computer and use it in GitHub Desktop.
Save lh3/acf621374581be03f0de654c45cf5ca8 to your computer and use it in GitHub Desktop.
Downloading gzip'd fastq
Source Dst. file type Protocol Time (s) Command Line
NCBI .sra ftp 296 wget
NCBI .fastq.gz sra toolkit ~23000 fastq-dump -Z --gzip --split-spot
local file sra=>fastq.gz sra toolkit ~15000 fastq-dump --gzip --split-spot --split-3
EBI .fastq.gz aspera 513+492 aspera -QT -l 300m
EBI .fastq.gz ftp 1876+1946 wget

Notes:

  • Destination: a super computer at Broad Institute (yes, the connection is that fast – 22GB in 5 minutes via ftp).

  • 513+492 means downloading read1 took 513 wall-clock seconds and read2 492 seconds. The two files were downloaded in parallel.

  • Time on downloading and file conversion with SRA toolkit is estimated based on partially downloaded/converted file sizes. Inaccurate.

  • SRA toolkit may be spending significant time on gzip compression. It is probably faster to convert to plain FASTQs first and then run gzip afterwards.

URLs:

# SRA ftp, SRA format
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR208/SRR2088062/SRR2088062.sra
# ENA ftp, gzip'd FASTQ format
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR208/002/SRR2088062/SRR2088062_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR208/002/SRR2088062/SRR2088062_2.fastq.gz
# ENA aspera, gzip'd FASTQ format
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR208/002/SRR2088062/SRR2088062_1.fastq.gz
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR208/002/SRR2088062/SRR2088062_2.fastq.gz
@rchikhi
Copy link

rchikhi commented Apr 9, 2016

There are significant differences in the performance of fastq-dump, whether you ask for gzip output (3-4x slower), write to a file with --split-3 instead of stdout (1.5x slower).

Gzip To stdout Command line wall-clock time
x fastq-dump -Z --split-spot 228s
x x fastq-dump --split-spot --split-3 352s
fastq-dump -Z --gzip --split-spot 1060s
x fastq-dump --gzip --split-spot --split-3 1155s

Tested on a local file, SRR1028232.sra (1 GB), with sratoolkit version 2.5.7 centos binaries. On a server in Pennsylvania. Time to download the SRA file using prefetch: 90 seconds.

In the post above, local .sra=>.fastq.gz is 50x slower than downloading the .sra. In this experiment here, it is 13x slower. Still, I think fastq-dump would need to be faster.

@lh3
Copy link
Author

lh3 commented May 1, 2016

Thanks a lot, @rchikhi. Just saw your comment. I was using --gzip because most of time we would not want to keep plain fastq. On an additional note, in my table, running fastq-dump on remote SRA accessions seems much slower than wget download + local fastq-dump. Have you observed this? If this is true, probably NCBI should not hide the FTP download links to SRA files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment