Skip to content

Instantly share code, notes, and snippets.

Last active May 6, 2024 06:38
Show Gist options
  • Save nathanhaigh/3521724 to your computer and use it in GitHub Desktop.
Save nathanhaigh/3521724 to your computer and use it in GitHub Desktop.
deinterleave FASTQ files
# Usage: < interleaved.fastq f.fastq r.fastq [compress]
# Deinterleaves a FASTQ file of paired reads into two FASTQ
# files specified on the command line. Optionally GZip compresses the output
# FASTQ files using pigz if the 3rd command line argument is the word "compress"
# Can deinterleave 100 million paired reads (200 million total
# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s)
# Latest code:
# Also see my interleaving script:
# Inspired by Torsten Seemann's blog post:
# Set up some defaults
# If the third argument is the word "compress" then we'll compress the output using pigz
if [[ $3 == "compress" ]]; then
if [[ ${GZIP_OUTPUT} == 0 ]]; then
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2
Copy link

I am having trouble getting the files to compress. I am running as:

gzip -dc test.fq.gz | f.fastq r.fastq compress

But my files are deleaving as fastq not fastq.gz. Should i be running as:

gzip -dc test.fq.gz | f.fastq.gz r.fastq.gz compress

Copy link

This is really handy, I'd make a minimal improvement to protect against the input process being terminated at the end of a valid record (by chance). Just the addition of the following near the start:

set -e
set -o pipefail

If overly cautious you could add a bit more tee and confirm the lines written to read_1 and read_2 are equal number (if not it's a good indicator that the file may not be well formed) and the sum is that of the input:

# tmpdir would be necessary for the count_* files
tee >(paste - - - - - - - -  | tee >(cut -f 1-4 | tr "\t" "\n" | tee >(wc -l > count_a)\
 | gzip -c > $1) | cut -f 5-8 | tr "\t" "\n" | tee >(wc -l > count_b)\
 | gzip -c > $2) | wc -l > count_all
# then some checking of  values

Copy link

a-kroh commented Sep 6, 2018

Worked well and fast, but created 4 empty lines at the end of the de-interleaved files that interferred with downstream applications and had to be removed manually. Maybe something that can be fixed in a future version.
Took me a while to figure out the problem (error message of downstream application was cryptic), thus I though this might be useful for others to know.

Copy link

zhenzhen3008 commented Sep 7, 2018

Exactly like a-kroh mentioned, somehow it generates empty lines at the end of each output file. When I run the Trimmomatic, it exits during the run. Since I saw the comments, it does not take me too long to figure out what happened. This is pretty handy, it would be great if this problem got fixed.
I used
$ egrep -v '^$' EMPTYLINE.fastq > NO_EMPTYLINE.fastq
to remove the empty lines.

Copy link

ugayujy commented Jun 14, 2019

Pretty cool!

Copy link

Thank you all.
I've edited it a bit:

cat input.fastq | paste - - - - - - - - | tee | cut -f 1-4 | tr "\t" "\n" | egrep -v '^$' > R1.fastq
cat input.fastq | paste - - - - - - - - | tee | cut -f 5-8 | tr "\t" "\n" | egrep -v '^$' > R2.fastq

Copy link

michaelsilverstein commented Jun 18, 2021

Thanks everyone! Here is how I have been deinterleaving an entire directory of compressed fastq.gz:


# Deinterleave entire directory of compressed .fastq.gz files and re-compress mates

#Usage: indir outdir

mkdir $2

for file in $1/*
        echo $file
        out1=$2/$(basename ${file%.fastq.gz})_R1.fastq.gz
        out2=$2/$(basename ${file%.fastq.gz})_R2.fastq.gz
        pigz --best --processes 16 -dc $file | $out1 $out2 compress

This script will read compressed files from indir, deinterleave them, and save them to outdir with _R1.fastq.gz and _R2.fastq.gz file extensions.

Copy link

telatin commented Oct 1, 2021

Hi! SeqFu bundles seqfu interleave and seqfu deinterleave. It's fast (compiled), and provides an easier and less error-prone CLI experience.
If you want to give a try see SeqFu website.

Can be installed via miniconda: conda install -c bioconda seqfu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment