Skip to content

Instantly share code, notes, and snippets.

@edawson
Last active February 5, 2019 14:07
Show Gist options
  • Save edawson/324890d62faa88713f347e77804ef08c to your computer and use it in GitHub Desktop.
Save edawson/324890d62faa88713f347e77804ef08c to your computer and use it in GitHub Desktop.
Split a FASTQ (or pair) into 100K read splits using GNU split and pigz. Modified from an original script by @ekg.
first_reads=$1
second_reads=$2
ddir=$(dirname $first_reads)
obase_first=$(basename $first_reads .fastq.gz)
obase_second=$(basename $second_reads .fastq.gz)
splitsz=4000000
if [ ! -z ${first_reads} ] && [ -e ${first_reads} ]
then
time pigz -p4 -cd ${first_reads} | \
split -d -a 6 -l ${splitsz} --filter='pigz -p4 > $FILE.gz' - ${ddir}/${obase_first}.fastq.part
else
echo "ERROR: no file ${first_reads} found."
fi
if [ ! -z "${second_reads}" ] && [ -e ${second_reads} ]
then
time pigz -p4 -cd ${second_reads} | \
split -d -a 6 -l ${splitsz} --filter='pigz -p4 > $FILE.gz' - ${ddir}/${obase_second}.fastq.part
fi
@edawson
Copy link
Author

edawson commented Sep 26, 2018

If you want more parallelism (and you have the disk IO), you can combine this script with GNU parallel or LaunChair to parallelize by fastq file as well.

GNU parallel on all fastqs in a directory, using four jobs (e.g. for a 16 core system):

ls | grep ".fastq.gz$" | parallel -j 4 './fastq_splitter.sh {}'

LaunChair example on a 16-core system:

git clone --recursive https://github.com/edawson/LaunChair

for i in `ls | grep -v "ERR894723" | grep "fastq.gz"`;
do 
   echo $i; echo "./fq_splitter.sh $i" >> jobfile.txt
done
python LaunChair/launcher.py -i jobfile.txt -c 4 -n 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment