Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save sujaikumar/8932968 to your computer and use it in GitHub Desktop.
Save sujaikumar/8932968 to your computer and use it in GitHub Desktop.
How to parallelise UCSC BLAT with gnu parallel
How to parallelise UCSC BLAT with gnu parallel
==============================================
I spent a long time working out how to gnu-parallelise UCSC's blat and most tricks to specify the query file didn't work (e.g. "-" "</dev/stdin" etc), so am posting what did work for me:
cat cdna.fa | parallel --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >/dev/null" >out.psl
If you don't do the >/dev/null - you get blat stdout messages like "Loaded X letters in Y sequences. Searched A bases in B sequences" in your output
@jgbradley1
Copy link

Hi, I've been attempting to use your script to parallelize a BLAT job. All of the parallel parameters make sense to me but I've noticed a couple of issues.

  1. When passing in a large fasta file (625 MB), out.psl is empty afterwards (o bytes). I fixed that by using the --results flag.
  2. Running pslCheck on the output file shows some lines have errors. Specifically, the number of tab delimiters found doesn't match the expected number.
  3. I can run separate blat jobs per chromosome and the final combined output doesn't match the output from the parallel job. I'm not talking about the order of the records, but the files have significantly different sizes. The total # of lines of each file don't match.

Because of issue #1, I am inclined to think that not all output is being collected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment