Brainiarc7/gnu-parallel-bioinformatics.md

## gnu-parallel-bioinformatics.md

      
    Raw
  

              gnu-parallel-bioinformatics.md
            
          
    GNU Parallel for Bioinformatics: A primer

Installation:
A quick installation does not require root access, as shown:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
On Ubuntu (as tested on 16.04), it is already available in the repositories:
sudo apt-get install parallel

EXAMPLE: Replace a for-loop
It is often faster to write a command using GNU Parallel than making a for loop:
for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked
done

Can also be written as:
parallel 'zcat {} > {.}.unpacked' ::: *.gz

The added benefit is that the zcats are run in parallel - one per CPU core.
EXAMPLE: Parallelizing BLAT:
This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:
cat foo.fa | parallel --round-robin --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >&2" >foo.psl

EXAMPLE: Blast on multiple machines:
Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:
cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine (and can ideally log into either of them using SSH keys), here named server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:
cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

EXAMPLE: Run bigWigToWig for each chromosome:
If you have one file per chomosome, it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.
parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M

EXAMPLE: Running composed commands:
GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):
parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

See this also for more information.
EXAMPLE: Running experiments:
Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age, --sex and  --chr:
experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:
parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:
parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single directory for neatness:
parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.
If you have many different parameters it may be handy to name them:
parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout
If one of your parameters take on many different values, these can be read from a file using '::::'
echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

Advanced example: Using GNU Parallel to parallelize you own scripts:
Assume you have BASH/Perl/Python script called launch. It takes one arguments, ID:
launch ID

Using parallel you can run multiple IDs in parallel using:
parallel launch ::: ID1 ID2 ...

But you would like to hide this complexity from the user, so the user only has to do:
launch ID1 ID2 ...

You can do that using --shebang-wrap. Change the shebang line from:
#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python

to:
#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python

You further develop your script so it now takes an ID and a DIR:
launch ID DIR

You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again, simply change the shebang line to:
#!/usr/bin/parallel --shebang-wrap bash

And now you can run:
launch ID1 ID2 ID3 ::: DIR

If you want/need to build a cluster of nodes that can run jobs with parallel over a shared sshfs mount, see this gist for the definitive guide.