-
-
Save nathanhaigh/3521724 to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress] | |
# | |
# Deinterleaves a FASTQ file of paired reads into two FASTQ | |
# files specified on the command line. Optionally GZip compresses the output | |
# FASTQ files using pigz if the 3rd command line argument is the word "compress" | |
# | |
# Can deinterleave 100 million paired reads (200 million total | |
# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s) | |
# | |
# Latest code: https://gist.github.com/3521724 | |
# Also see my interleaving script: https://gist.github.com/4544979 | |
# | |
# Inspired by Torsten Seemann's blog post: | |
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html | |
# Set up some defaults | |
GZIP_OUTPUT=0 | |
PIGZ_COMPRESSION_THREADS=10 | |
# If the third argument is the word "compress" then we'll compress the output using pigz | |
if [[ $3 == "compress" ]]; then | |
GZIP_OUTPUT=1 | |
fi | |
if [[ ${GZIP_OUTPUT} == 0 ]]; then | |
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2 | |
else | |
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2 | |
fi |
Nice, indeed fast & simple!
I have thought about setting up and using named pipes (fifo's) to do deinterleaving on-the-fly so as to avoid the overhead of storing both interleaved and deinterleaved files. However I can't seem to get them to work properly! I've been trying something like this:
mkfifo f.fastq
mkfifo r.fastq
# deinterleave a paired-end/matepair FASTQ file to the named pipes
deinterleave_paste.sh < infile.fastq f.fastq r.fastq
# setup background process to read/process reads from each of the named pipes
cat r.fastq &
cat f.fastq &
Wow, I just wanted to say thank you for this simple yet fast and useful solution, thanks!
Indeed very nice!. FYI put the deinterleaver into background to make it work with fifos:
mkfifo f.fastq
mkfifo r.fastq
# deinterleave a paired-end/matepair FASTQ file to the named pipes (IN BACKGROUND)
deinterleave_paste.sh f.fastq r.fastq < infile.fastq &
# setup process to read/process reads from each of the named pipes (IN FOREGROUND)
cat r.fastq f.fastq
That's pretty cool
How to modify this to deinterleave a zipped fastq file (i.e. fastq.gz)?
this should work: gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq r.fastq
Or in python:
r1 = open(r1.fastq,'w')
r2 = open(r2.fastq,'w')
[r1.write(line) if (i % 8 < 4) else r2.write(line) for i, line in enumerate(open('interleaved.fastq'))]
fastq_1.close()
fastq_2.close()
I love this and I'd like to include it as part of my routine workflow. Could you please include a license or a comment line to indicate that this is OK with you (or what you're OK with)? Thanks!!
Awesome. I'm also using this. Thanks a lot!
I am having trouble getting the files to compress. I am running as:
gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq r.fastq compress
But my files are deleaving as fastq not fastq.gz. Should i be running as:
gzip -dc test.fq.gz | deinterleave_fastq.sh f.fastq.gz r.fastq.gz compress
This is really handy, I'd make a minimal improvement to protect against the input process being terminated at the end of a valid record (by chance). Just the addition of the following near the start:
set -e
set -o pipefail
If overly cautious you could add a bit more tee
and confirm the lines written to read_1 and read_2 are equal number (if not it's a good indicator that the file may not be well formed) and the sum is that of the input:
# tmpdir would be necessary for the count_* files
tee >(paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | tee >(wc -l > count_a)\
| gzip -c > $1) | cut -f 5-8 | tr "\t" "\n" | tee >(wc -l > count_b)\
| gzip -c > $2) | wc -l > count_all
# then some checking of values
Worked well and fast, but created 4 empty lines at the end of the de-interleaved files that interferred with downstream applications and had to be removed manually. Maybe something that can be fixed in a future version.
Took me a while to figure out the problem (error message of downstream application was cryptic), thus I though this might be useful for others to know.
Exactly like a-kroh mentioned, somehow it generates empty lines at the end of each output file. When I run the Trimmomatic, it exits during the run. Since I saw the comments, it does not take me too long to figure out what happened. This deinterleave_fastq.sh is pretty handy, it would be great if this problem got fixed.
I used
$ egrep -v '^$' EMPTYLINE.fastq > NO_EMPTYLINE.fastq
to remove the empty lines.
Pretty cool!
Thanks
Thank you all.
I've edited it a bit:
cat input.fastq | paste - - - - - - - - | tee | cut -f 1-4 | tr "\t" "\n" | egrep -v '^$' > R1.fastq
cat input.fastq | paste - - - - - - - - | tee | cut -f 5-8 | tr "\t" "\n" | egrep -v '^$' > R2.fastq
Thanks everyone! Here is how I have been deinterleaving an entire directory of compressed fastq.gz:
https://gist.github.com/michaelsilverstein/04c880b8e7728982ee57399599cfb56d#file-deinterleave_dir-sh
#!/bin/bash
# Deinterleave entire directory of compressed .fastq.gz files and re-compress mates
#Usage: deinterleave_dir.sh indir outdir
mkdir $2
for file in $1/*
do
echo $file
out1=$2/$(basename ${file%.fastq.gz})_R1.fastq.gz
out2=$2/$(basename ${file%.fastq.gz})_R2.fastq.gz
pigz --best --processes 16 -dc $file | deinterleave_fastq.sh $out1 $out2 compress
done
This script will read compressed files from indir
, deinterleave them, and save them to outdir
with _R1.fastq.gz
and _R2.fastq.gz
file extensions.
Hi! SeqFu bundles seqfu interleave
and seqfu deinterleave
. It's fast (compiled), and provides an easier and less error-prone CLI experience.
If you want to give a try see SeqFu website.
Can be installed via miniconda: conda install -c bioconda seqfu
.
Using this script it takes:
2m26s to deinterleave the same 4.1GByte FASTQ file residing on a RAID5 filesystem (5*15k rpm disks) and writing to the same filesystem.