Skip to content

Instantly share code, notes, and snippets.

@nathanhaigh
Last active January 22, 2024 15:25
Show Gist options
  • Star 77 You must be signed in to star a gist
  • Fork 15 You must be signed in to fork a gist
  • Save nathanhaigh/3521724 to your computer and use it in GitHub Desktop.
Save nathanhaigh/3521724 to your computer and use it in GitHub Desktop.
deinterleave FASTQ files
#!/bin/bash
# Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress]
#
# Deinterleaves a FASTQ file of paired reads into two FASTQ
# files specified on the command line. Optionally GZip compresses the output
# FASTQ files using pigz if the 3rd command line argument is the word "compress"
#
# Can deinterleave 100 million paired reads (200 million total
# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s)
#
# Latest code: https://gist.github.com/3521724
# Also see my interleaving script: https://gist.github.com/4544979
#
# Inspired by Torsten Seemann's blog post:
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html
# Set up some defaults
GZIP_OUTPUT=0
PIGZ_COMPRESSION_THREADS=10
# If the third argument is the word "compress" then we'll compress the output using pigz
if [[ $3 == "compress" ]]; then
GZIP_OUTPUT=1
fi
if [[ ${GZIP_OUTPUT} == 0 ]]; then
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2
else
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2
fi
@nathanhaigh
Copy link
Author

This is the fastest FASTQ deinterleaver I've seen as it uses native Linux commands paste, tee, tr and cut to process the file and there are no calculations required - just reformatting.

It assumes each read occupies 4 lines and read pairs are interleaved i.e. a block of 8 lines contain a pair of reads.

The bottleneck is usually disk IO so I've taken to mounting a tmpfs on my large memory machine and doing deinterleaving from there.

Using this script it takes:
58s to deinterleave a 4.1GByte FASTQ file residing on a tmpfs and writing to the same tmpfs.

@nathanhaigh
Copy link
Author

Using this script it takes:
2m26s to deinterleave the same 4.1GByte FASTQ file residing on a RAID5 filesystem (5*15k rpm disks) and writing to the same filesystem.

@HenrivdGeest
Copy link

Nice, indeed fast & simple!

@nathanhaigh
Copy link
Author

I have thought about setting up and using named pipes (fifo's) to do deinterleaving on-the-fly so as to avoid the overhead of storing both interleaved and deinterleaved files. However I can't seem to get them to work properly! I've been trying something like this:

mkfifo f.fastq
mkfifo r.fastq

# deinterleave a paired-end/matepair FASTQ file to the named pipes
deinterleave_paste.sh < infile.fastq f.fastq r.fastq

# setup background process to read/process reads from each of the named pipes
cat r.fastq &
cat f.fastq &

@guillermo-carrasco
Copy link

Wow, I just wanted to say thank you for this simple yet fast and useful solution, thanks!

@thasso
Copy link

thasso commented Nov 11, 2013

Indeed very nice!. FYI put the deinterleaver into background to make it work with fifos:

mkfifo f.fastq
mkfifo r.fastq

# deinterleave a paired-end/matepair FASTQ file to the named pipes (IN BACKGROUND)
deinterleave_paste.sh f.fastq r.fastq < infile.fastq &

# setup process to read/process reads from each of the named pipes (IN FOREGROUND)
cat r.fastq f.fastq

@inodb
Copy link

inodb commented Dec 5, 2013

That's pretty cool

@sentausa
Copy link

How to modify this to deinterleave a zipped fastq file (i.e. fastq.gz)?

@leffj
Copy link

leffj commented Jul 26, 2016

this should work: gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq r.fastq

@GlastonburyC
Copy link

Or in python:

r1 = open(r1.fastq,'w')
r2 = open(r2.fastq,'w')
[r1.write(line) if (i % 8 < 4) else r2.write(line) for i, line in enumerate(open('interleaved.fastq'))]
fastq_1.close()
fastq_2.close()


@rrohwer
Copy link

rrohwer commented Aug 16, 2017

I love this and I'd like to include it as part of my routine workflow. Could you please include a license or a comment line to indicate that this is OK with you (or what you're OK with)? Thanks!!

@mahmadza
Copy link

Awesome. I'm also using this. Thanks a lot!

@spongebob22
Copy link

I am having trouble getting the files to compress. I am running as:

gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq r.fastq compress

But my files are deleaving as fastq not fastq.gz. Should i be running as:

gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq.gz r.fastq.gz compress

@keiranmraine
Copy link

This is really handy, I'd make a minimal improvement to protect against the input process being terminated at the end of a valid record (by chance). Just the addition of the following near the start:

set -e
set -o pipefail

If overly cautious you could add a bit more tee and confirm the lines written to read_1 and read_2 are equal number (if not it's a good indicator that the file may not be well formed) and the sum is that of the input:

# tmpdir would be necessary for the count_* files
tee >(paste - - - - - - - -  | tee >(cut -f 1-4 | tr "\t" "\n" | tee >(wc -l > count_a)\
 | gzip -c > $1) | cut -f 5-8 | tr "\t" "\n" | tee >(wc -l > count_b)\
 | gzip -c > $2) | wc -l > count_all
# then some checking of  values

@a-kroh
Copy link

a-kroh commented Sep 6, 2018

Worked well and fast, but created 4 empty lines at the end of the de-interleaved files that interferred with downstream applications and had to be removed manually. Maybe something that can be fixed in a future version.
Took me a while to figure out the problem (error message of downstream application was cryptic), thus I though this might be useful for others to know.

@zhenzhen3008
Copy link

zhenzhen3008 commented Sep 7, 2018

Exactly like a-kroh mentioned, somehow it generates empty lines at the end of each output file. When I run the Trimmomatic, it exits during the run. Since I saw the comments, it does not take me too long to figure out what happened. This deinterleave_fastq.sh is pretty handy, it would be great if this problem got fixed.
I used
$ egrep -v '^$' EMPTYLINE.fastq > NO_EMPTYLINE.fastq
to remove the empty lines.

@ugayujy
Copy link

ugayujy commented Jun 14, 2019

Pretty cool!
Thanks

@sinamajidian
Copy link

Thank you all.
I've edited it a bit:

cat input.fastq | paste - - - - - - - - | tee | cut -f 1-4 | tr "\t" "\n" | egrep -v '^$' > R1.fastq
cat input.fastq | paste - - - - - - - - | tee | cut -f 5-8 | tr "\t" "\n" | egrep -v '^$' > R2.fastq

@michaelsilverstein
Copy link

michaelsilverstein commented Jun 18, 2021

Thanks everyone! Here is how I have been deinterleaving an entire directory of compressed fastq.gz:
https://gist.github.com/michaelsilverstein/04c880b8e7728982ee57399599cfb56d#file-deinterleave_dir-sh

#!/bin/bash 

# Deinterleave entire directory of compressed .fastq.gz files and re-compress mates

#Usage: deinterleave_dir.sh indir outdir

mkdir $2

for file in $1/*
do
        echo $file
        out1=$2/$(basename ${file%.fastq.gz})_R1.fastq.gz
        out2=$2/$(basename ${file%.fastq.gz})_R2.fastq.gz
        pigz --best --processes 16 -dc $file | deinterleave_fastq.sh $out1 $out2 compress
done

This script will read compressed files from indir, deinterleave them, and save them to outdir with _R1.fastq.gz and _R2.fastq.gz file extensions.

@telatin
Copy link

telatin commented Oct 1, 2021

Hi! SeqFu bundles seqfu interleave and seqfu deinterleave. It's fast (compiled), and provides an easier and less error-prone CLI experience.
If you want to give a try see SeqFu website.

Can be installed via miniconda: conda install -c bioconda seqfu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment