Skip to content

Instantly share code, notes, and snippets.

@sujaikumar
Last active November 2, 2023 04:46
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
NCBI blastp bug - changing max_target_seqs returns incorrect top hits

NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.

The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.

At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)

To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

Enter the following FASTA sequence in the query text box:

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:

  • The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.

If you now change the Max target sequences to 500, and rerun, you see:

  • The top hit is Bacteria (Burkholderia kururiensis 2e-40)

This is reproducible on the command line (using the versions of blastp mentioned above):

blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt

Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/

Screenshots attached:

  1. max100.png - https://www.dropbox.com/s/sbfaviez7hon4it/max100.png?dl=0
  2. max500.png - https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0
@peterjc
Copy link

peterjc commented May 23, 2019

Belatedly adding links to two blog posts on this issue, which I think wraps things up.

Dec 2018: Examining the way BLAST database order is defined, and implications when building your own databases: https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html

Jan 2019: A summary post looking at the other issue which was causing the strange results in the Shah et al 2018 test case (fixed in BLAST+ 2.8.1): https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment