#!/bin/bash | |
# Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress] | |
# | |
# Deinterleaves a FASTQ file of paired reads into two FASTQ | |
# files specified on the command line. Optionally GZip compresses the output | |
# FASTQ files using pigz if the 3rd command line argument is the word "compress" | |
# | |
# Can deinterleave 100 million paired reads (200 million total | |
# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s) | |
# | |
# Latest code: https://gist.github.com/3521724 | |
# Also see my interleaving script: https://gist.github.com/4544979 | |
# | |
# Inspired by Torsten Seemann's blog post: | |
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html | |
# Set up some defaults | |
GZIP_OUTPUT=0 | |
PIGZ_COMPRESSION_THREADS=10 | |
# If the third argument is the word "compress" then we'll compress the output using pigz | |
if [[ $3 == "compress" ]]; then | |
GZIP_OUTPUT=1 | |
fi | |
if [[ ${GZIP_OUTPUT} == 0 ]]; then | |
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2 | |
else | |
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2 | |
fi |
This comment has been minimized.
This comment has been minimized.
Using this script it takes: |
This comment has been minimized.
This comment has been minimized.
Nice, indeed fast & simple! |
This comment has been minimized.
This comment has been minimized.
I have thought about setting up and using named pipes (fifo's) to do deinterleaving on-the-fly so as to avoid the overhead of storing both interleaved and deinterleaved files. However I can't seem to get them to work properly! I've been trying something like this:
|
This comment has been minimized.
This comment has been minimized.
Wow, I just wanted to say thank you for this simple yet fast and useful solution, thanks! |
This comment has been minimized.
This comment has been minimized.
Indeed very nice!. FYI put the deinterleaver into background to make it work with fifos:
|
This comment has been minimized.
This comment has been minimized.
That's pretty cool |
This comment has been minimized.
This comment has been minimized.
How to modify this to deinterleave a zipped fastq file (i.e. fastq.gz)? |
This comment has been minimized.
This comment has been minimized.
this should work: |
This comment has been minimized.
This comment has been minimized.
Or in python:
|
This comment has been minimized.
This comment has been minimized.
I love this and I'd like to include it as part of my routine workflow. Could you please include a license or a comment line to indicate that this is OK with you (or what you're OK with)? Thanks!! |
This comment has been minimized.
This comment has been minimized.
Awesome. I'm also using this. Thanks a lot! |
This comment has been minimized.
This comment has been minimized.
I am having trouble getting the files to compress. I am running as: gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq r.fastq compress But my files are deleaving as fastq not fastq.gz. Should i be running as: gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq.gz r.fastq.gz compress |
This comment has been minimized.
This comment has been minimized.
This is really handy, I'd make a minimal improvement to protect against the input process being terminated at the end of a valid record (by chance). Just the addition of the following near the start:
If overly cautious you could add a bit more
|
This comment has been minimized.
This comment has been minimized.
Worked well and fast, but created 4 empty lines at the end of the de-interleaved files that interferred with downstream applications and had to be removed manually. Maybe something that can be fixed in a future version. |
This comment has been minimized.
This comment has been minimized.
Exactly like a-kroh mentioned, somehow it generates empty lines at the end of each output file. When I run the Trimmomatic, it exits during the run. Since I saw the comments, it does not take me too long to figure out what happened. This deinterleave_fastq.sh is pretty handy, it would be great if this problem got fixed. |
This comment has been minimized.
This comment has been minimized.
Pretty cool! |
This comment has been minimized.
This comment has been minimized.
Thank you all.
|
This comment has been minimized.
This is the fastest FASTQ deinterleaver I've seen as it uses native Linux commands paste, tee, tr and cut to process the file and there are no calculations required - just reformatting.
It assumes each read occupies 4 lines and read pairs are interleaved i.e. a block of 8 lines contain a pair of reads.
The bottleneck is usually disk IO so I've taken to mounting a tmpfs on my large memory machine and doing deinterleaving from there.
Using this script it takes:
58s to deinterleave a 4.1GByte FASTQ file residing on a tmpfs and writing to the same tmpfs.