Skip to content

Instantly share code, notes, and snippets.

@sujaikumar
Last active November 2, 2023 04:46
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
NCBI blastp bug - changing max_target_seqs returns incorrect top hits

NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.

The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.

At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)

To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

Enter the following FASTA sequence in the query text box:

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:

  • The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.

If you now change the Max target sequences to 500, and rerun, you see:

  • The top hit is Bacteria (Burkholderia kururiensis 2e-40)

This is reproducible on the command line (using the versions of blastp mentioned above):

blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt

Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/

Screenshots attached:

  1. max100.png - https://www.dropbox.com/s/sbfaviez7hon4it/max100.png?dl=0
  2. max500.png - https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0
@jarekbryk
Copy link

Note that the Shah et al paper suggest that this parameter behaves as expected in the online version of BLAST! This is a relevant fragment (emphasis mine):

"This functionality was first reported as a bug to NCBI by Sujai Kumar in 2015 (2015), and later docu- mented in a blog post (Cock, 2015) by Peter Cock. The functionality remains unchanged to this day, and the BLAST documentation (2008-) (last modified in 2016) fails to clarify the misconception a reasonable user would have upon reading the manual. The confusion is further compounded by the fact that in the online BLAST portal, the max_target_seqs parameter behaves in the expected way – the best (rather than first) N hits are returned."

Could anyone confirm this?

@peterjc
Copy link

peterjc commented Oct 23, 2018

Noting here, given I've just replied to a comment on https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html

Regarding the different argument limit options for human readable output (-num_descriptions and -num_alignments) and computer readable output (-max_target_seqs), internally both are mapped to a limit on the number of alignments, and this same surprising behaviour can be demonstrated with plain text output and -num_descriptions and -num_alignments in exactly the same as the computer readable formats like tabular with -max_target_seqs.

I am working on a followup post of some sort with more details..

@peterjc
Copy link

peterjc commented Nov 1, 2018

Over at https://github.com/peterjc/blast_max_target_seqs I have created a small test case based on @sujaikumar's example above.

I have a draft blog post introducing the new test case, and showing the that the same surprising behaviour seen with -max_target_seqs also happens with -num_descriptions and -num_alignments ready to go live today or tomorrow (it needs proof reading first).

@peterjc
Copy link

peterjc commented Nov 2, 2018

My new blog post is up at https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html

This introduces the minimal test case https://github.com/peterjc/blast_max_target_seqs based on @sujaik's 2015 report (this gist), and highlights that both -max_target_seqs AND -num_descriptions / -num_alignments can change the top hit.

Also makes the point that these are documented under different sections of the command line help, as noted above https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5#gistcomment-2185621 by @terrycojones

@peterjc
Copy link

peterjc commented Nov 2, 2018

Part two is up at https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html

Here I focus on the question of if database order is important (as claimed in Shar et al. 2018), and how exactly the internal alignment number limit works.

As a post script I discussed https://www.biostars.org/p/341227 which is an elegant minimal test case, again based on Sujai's original.

@peterjc
Copy link

peterjc commented Nov 15, 2018

Nidhi Shah has shared a test case at https://github.com/shahnidhi/BLAST_maxtargetseq_analysis (they got in touch via a comment on my blog).

Blog part three looks at the test case from Nidhi Shah https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html

Blog part four at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used) https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html

@peterjc
Copy link

peterjc commented May 23, 2019

Belatedly adding links to two blog posts on this issue, which I think wraps things up.

Dec 2018: Examining the way BLAST database order is defined, and implications when building your own databases: https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html

Jan 2019: A summary post looking at the other issue which was causing the strange results in the Shah et al 2018 test case (fixed in BLAST+ 2.8.1): https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment