Skip to content

Instantly share code, notes, and snippets.

View sujaikumar's full-sized avatar

Sujai Kumar sujaikumar

View GitHub Profile

Keybase proof

I hereby claim:

  • I am sujaikumar on github.
  • I am sujai (https://keybase.io/sujai) on keybase.
  • I have a public key ASCyGc7HYwAEytv2WA0UK2tyl-xK4E0khlopKkyYaj7HYQo

To claim this, I am signing this object:

@sujaikumar
sujaikumar / gist:06c168bc224a0b4b940391e7f22346ad
Last active October 13, 2017 11:28
confluence markdown import tesy
# heading1
Para
## heading2
- bullet1
- bullet2
[Heligmosomoides bakeri][Heligmosomoides bakeri]
@sujaikumar
sujaikumar / UniRef90.md
Last active January 29, 2024 08:14
UniRef90 protein blast database with taxon IDs

Goal

  • To create UniRef90 protein databases for NCBI blast and Diamond Blast
  • To create a tab delimited taxid mapping file with two columns : sequenceID\tNCBITaxonID

Steps:

Download the uniref90 xml file first (warning - this is ~15 GB, will take a while)

@sujaikumar
sujaikumar / 2015-12-07-UNC-15-largest-scaffolds.md
Last active December 10, 2015 19:58
Best hits for UNC tardigrade genome assembly

UNC Tardigrade genome assembly - 15 largest scaffolds

A single command to get the list of UNC's own 'best-hit' annotations for their 15 longest scaffolds:

curl http://weatherby.genetics.utah.edu/seq_transf/tg.default.final.gff.gz \
| zgrep -P "\tmRNA\t" | sort -k2,2gr -t 'e' \
| sort -k 1V \
| awk '{print; if(/scaffold15/){exit}}' \
| perl -plne 's/maker\tmRNA\t//; s/\.\t.*?\(/(/;' \
@sujaikumar
sujaikumar / 2015-11-30-blastp-bug.md
Last active November 2, 2023 04:46
NCBI blastp bug - changing max_target_seqs returns incorrect top hits

NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.

The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.

At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)

@sujaikumar
sujaikumar / How to parallelise UCSC BLAT with gnu parallel
Last active August 29, 2015 13:56
How to parallelise UCSC BLAT with gnu parallel
How to parallelise UCSC BLAT with gnu parallel
==============================================
I spent a long time working out how to gnu-parallelise UCSC's blat and most tricks to specify the query file didn't work (e.g. "-" "</dev/stdin" etc), so am posting what did work for me:
cat cdna.fa | parallel --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >/dev/null" >out.psl
If you don't do the >/dev/null - you get blat stdout messages like "Loaded X letters in Y sequences. Searched A bases in B sequences" in your output