Skip to content

Instantly share code, notes, and snippets.

@Brainiarc7
Last active September 6, 2023 13:59
Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save Brainiarc7/7af2ab5e88ef238da2d9f36b4be203c0 to your computer and use it in GitHub Desktop.
Save Brainiarc7/7af2ab5e88ef238da2d9f36b4be203c0 to your computer and use it in GitHub Desktop.
GNU Parallel usage in Bioinformatics with examples: A primer.

GNU Parallel for Bioinformatics: A primer

Installation:

A quick installation does not require root access, as shown:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

On Ubuntu (as tested on 16.04), it is already available in the repositories:

sudo apt-get install parallel

EXAMPLE: Replace a for-loop

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked
done

Can also be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

The added benefit is that the zcats are run in parallel - one per CPU core.

EXAMPLE: Parallelizing BLAT:

This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:

cat foo.fa | parallel --round-robin --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >&2" >foo.psl

EXAMPLE: Blast on multiple machines:

Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine (and can ideally log into either of them using SSH keys), here named server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

EXAMPLE: Run bigWigToWig for each chromosome:

If you have one file per chomosome, it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.

parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M

EXAMPLE: Running composed commands:

GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

See this also for more information.

EXAMPLE: Running experiments:

Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age, --sex and --chr:

experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:

parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:

parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single directory for neatness:

parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.

If you have many different parameters it may be handy to name them:

parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout

If one of your parameters take on many different values, these can be read from a file using '::::'

echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

Advanced example: Using GNU Parallel to parallelize you own scripts:

Assume you have BASH/Perl/Python script called launch. It takes one arguments, ID:

launch ID

Using parallel you can run multiple IDs in parallel using:

parallel launch ::: ID1 ID2 ...

But you would like to hide this complexity from the user, so the user only has to do:

launch ID1 ID2 ...

You can do that using --shebang-wrap. Change the shebang line from:

#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python

to:

#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python

You further develop your script so it now takes an ID and a DIR:

launch ID DIR

You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again, simply change the shebang line to:

#!/usr/bin/parallel --shebang-wrap bash

And now you can run:

launch ID1 ID2 ID3 ::: DIR

If you want/need to build a cluster of nodes that can run jobs with parallel over a shared sshfs mount, see this gist for the definitive guide.

@HeLi-80
Copy link

HeLi-80 commented Sep 6, 2023

Hello @Brainiarc7 ,
with the help of this page I was able to parallelize Blast. Thank you very much.
I hope you can help me with a problem I am having when introducing custom alignment view with the outfmt option.

For example, standard Blastx works fine with these parameters:

blastx
-query query.fasta
-db nr
-out out.txt
-taxids 5331
-outfmt '6 qaccver saccver stitle pident qcovs length evalue ssciname'

But when I parallelize it, the outmft line is not recognised well:

cat query.fasta |
parallel
--block 100k
--recstart '>'
--pipe blastx
-taxids 5331
-outfmt '6 qaccver saccver stitle pident qcovs length evalue ssciname'
-db nr
-query - > out.txt

The result, repeated as often as the number of parallel processes, is:

Error: Too many positional arguments (1), the offending value: qaccver
Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: qaccver

Thank you in advance, I hope you can help me.

Elio

@HeLi-80
Copy link

HeLi-80 commented Sep 6, 2023

Never mind, I found the solution:

-outfmt "'6 qaccver saccver stitle pident qcovs length evalue ssciname'"

Thank you anyway

@Brainiarc7
Copy link
Author

Brainiarc7 commented Sep 6, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment