Instantly share code, notes, and snippets.

Embed
What would you like to do?
deinterleave FASTQ files
#!/bin/bash
# Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress]
#
# Deinterleaves a FASTQ file of paired reads into two FASTQ
# files specified on the command line. Optionally GZip compresses the output
# FASTQ files using pigz if the 3rd command line argument is the word "compress"
#
# Can deinterleave 100 million paired reads (200 million total
# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s)
#
# Latest code: https://gist.github.com/3521724
# Also see my interleaving script: https://gist.github.com/4544979
#
# Inspired by Torsten Seemann's blog post:
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html
# Set up some defaults
GZIP_OUTPUT=0
PIGZ_COMPRESSION_THREADS=10
# If the third argument is the word "compress" then we'll compress the output using pigz
if [[ $3 == "compress" ]]; then
GZIP_OUTPUT=1
fi
if [[ ${GZIP_OUTPUT} == 0 ]]; then
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2
else
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2
fi
@nathanhaigh

This comment has been minimized.

Owner

nathanhaigh commented Aug 30, 2012

This is the fastest FASTQ deinterleaver I've seen as it uses native Linux commands paste, tee, tr and cut to process the file and there are no calculations required - just reformatting.

It assumes each read occupies 4 lines and read pairs are interleaved i.e. a block of 8 lines contain a pair of reads.

The bottleneck is usually disk IO so I've taken to mounting a tmpfs on my large memory machine and doing deinterleaving from there.

Using this script it takes:
58s to deinterleave a 4.1GByte FASTQ file residing on a tmpfs and writing to the same tmpfs.

@nathanhaigh

This comment has been minimized.

Owner

nathanhaigh commented Aug 31, 2012

Using this script it takes:
2m26s to deinterleave the same 4.1GByte FASTQ file residing on a RAID5 filesystem (5*15k rpm disks) and writing to the same filesystem.

@HenrivdGeest

This comment has been minimized.

HenrivdGeest commented Sep 12, 2012

Nice, indeed fast & simple!

@nathanhaigh

This comment has been minimized.

Owner

nathanhaigh commented Sep 18, 2012

I have thought about setting up and using named pipes (fifo's) to do deinterleaving on-the-fly so as to avoid the overhead of storing both interleaved and deinterleaved files. However I can't seem to get them to work properly! I've been trying something like this:

mkfifo f.fastq
mkfifo r.fastq

# deinterleave a paired-end/matepair FASTQ file to the named pipes
deinterleave_paste.sh < infile.fastq f.fastq r.fastq

# setup background process to read/process reads from each of the named pipes
cat r.fastq &
cat f.fastq &
@guillermo-carrasco

This comment has been minimized.

guillermo-carrasco commented Apr 25, 2013

Wow, I just wanted to say thank you for this simple yet fast and useful solution, thanks!

@thasso

This comment has been minimized.

thasso commented Nov 11, 2013

Indeed very nice!. FYI put the deinterleaver into background to make it work with fifos:

mkfifo f.fastq
mkfifo r.fastq

# deinterleave a paired-end/matepair FASTQ file to the named pipes (IN BACKGROUND)
deinterleave_paste.sh f.fastq r.fastq < infile.fastq &

# setup process to read/process reads from each of the named pipes (IN FOREGROUND)
cat r.fastq f.fastq
@inodb

This comment has been minimized.

inodb commented Dec 5, 2013

That's pretty cool

@sentausa

This comment has been minimized.

sentausa commented Apr 27, 2015

How to modify this to deinterleave a zipped fastq file (i.e. fastq.gz)?

@leffj

This comment has been minimized.

leffj commented Jul 26, 2016

this should work: gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq r.fastq

@GlastonburyC

This comment has been minimized.

GlastonburyC commented Apr 6, 2017

Or in python:

r1 = open(r1.fastq,'w')
r2 = open(r2.fastq,'w')
[r1.write(line) if (i % 8 < 4) else r2.write(line) for i, line in enumerate(open('interleaved.fastq'))]
fastq_1.close()
fastq_2.close()


@rrohwer

This comment has been minimized.

rrohwer commented Aug 16, 2017

I love this and I'd like to include it as part of my routine workflow. Could you please include a license or a comment line to indicate that this is OK with you (or what you're OK with)? Thanks!!

@mahmadza

This comment has been minimized.

mahmadza commented Nov 17, 2017

Awesome. I'm also using this. Thanks a lot!

@spongebob22

This comment has been minimized.

spongebob22 commented Apr 12, 2018

I am having trouble getting the files to compress. I am running as:

gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq r.fastq compress

But my files are deleaving as fastq not fastq.gz. Should i be running as:

gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq.gz r.fastq.gz compress

@keiranmraine

This comment has been minimized.

keiranmraine commented Jul 11, 2018

This is really handy, I'd make a minimal improvement to protect against the input process being terminated at the end of a valid record (by chance). Just the addition of the following near the start:

set -e
set -o pipefail

If overly cautious you could add a bit more tee and confirm the lines written to read_1 and read_2 are equal number (if not it's a good indicator that the file may not be well formed) and the sum is that of the input:

# tmpdir would be necessary for the count_* files
tee >(paste - - - - - - - -  | tee >(cut -f 1-4 | tr "\t" "\n" | tee >(wc -l > count_a)\
 | gzip -c > $1) | cut -f 5-8 | tr "\t" "\n" | tee >(wc -l > count_b)\
 | gzip -c > $2) | wc -l > count_all
# then some checking of  values
@a-kroh

This comment has been minimized.

a-kroh commented Sep 6, 2018

Worked well and fast, but created 4 empty lines at the end of the de-interleaved files that interferred with downstream applications and had to be removed manually. Maybe something that can be fixed in a future version.
Took me a while to figure out the problem (error message of downstream application was cryptic), thus I though this might be useful for others to know.

@zhenzhen3008

This comment has been minimized.

zhenzhen3008 commented Sep 7, 2018

Exactly like a-kroh mentioned, somehow it generates empty lines at the end of each output file. When I run the Trimmomatic, it exits during the run. Since I saw the comments, it does not take me too long to figure out what happened. This deinterleave_fastq.sh is pretty handy, it would be great if this problem got fixed.
I used
$ egrep -v '^$' EMPTYLINE.fastq > NO_EMPTYLINE.fastq
to remove the empty lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment