Skip to content

Instantly share code, notes, and snippets.

@sujaikumar
Last active November 2, 2023 04:46
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
NCBI blastp bug - changing max_target_seqs returns incorrect top hits

NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.

The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.

At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)

To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

Enter the following FASTA sequence in the query text box:

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:

  • The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.

If you now change the Max target sequences to 500, and rerun, you see:

  • The top hit is Bacteria (Burkholderia kururiensis 2e-40)

This is reproducible on the command line (using the versions of blastp mentioned above):

blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt

Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/

Screenshots attached:

  1. max100.png - https://www.dropbox.com/s/sbfaviez7hon4it/max100.png?dl=0
  2. max500.png - https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0
@terrycojones
Copy link

@tshalev, you ask "so would it be best to just run everything with default settings and then filter later?"

Unfortunately not. The problem is that with the default settings a very high-scoring match may be prematurely discarded in the ungapped stage. So if you just leave the defaults you'll never see it. You need to pass some increased value for -max_target_seqs (for -outfmt >= 4) or the other options (-outfmt < 4) but there is absolutely no guidance as to what value you should use. It's a terrible situation.

In my case (above) I now find that if I run with -max_target_seqs 506 I get the expected match in the 3rd highest position. If I use -max_target_seqs 505 that match is not present in the output. How can a user know what to set -max_target_seqs to when they don't even know what they might be missing? This problem is so serious it makes BLAST suddenly seem almost useless (or at least terribly unreliable), as you may have false negatives that are actually the very best matches. Maybe the only thing to do is set -max_target_seqs to some extremely high value (like the number of sequences in your database) and then filter the output. That approach is what the NCBI mail is suggesting we do (though presumably with a smaller -max_target_seqs than the entire database size - but how small?). I.e., do not rely on -max_target_seqs to filter things for you because it is applied too early in the BLAST matching procedure.

I hope I don't sound too cranky here. I just can't believe this is the default behavior of BLAST, that the documentation of this option is so poor, that the option's name is really misleading, that the impact is so severe (losing the very best hits!), and that there must be many many thousands of people using BLAST who (understandably) have no inkling or expectation that this would be the default behavior. All this in a program that has been around for so long, has had so much work done on it, etc. And where do people go to to discuss such a thing? Not to an issue tracker, or similar, but to gist.

Copy link

ghost commented Nov 2, 2017

@terrycojones thanks for your comment. Did you ever get an answer from blast-help@ncbi.nlm.nih.gov ?

@voiDnyx
Copy link

voiDnyx commented Nov 14, 2017

Is this still an issue? I found nothing in the release notes of recent ncbi-blast versions saying something changed. Also is this commonly known today? Because I was totally not expecting this behaviour.

@peterjc
Copy link

peterjc commented Sep 26, 2018

My blog post about this https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html (mentioned above) just got cited in a new paper about this NCBI BLAST+ bug/feature:

Shah et al. (2018)
Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows.
https://doi.org/10.1093/bioinformatics/bty833

@jarekbryk
Copy link

Note that the Shah et al paper suggest that this parameter behaves as expected in the online version of BLAST! This is a relevant fragment (emphasis mine):

"This functionality was first reported as a bug to NCBI by Sujai Kumar in 2015 (2015), and later docu- mented in a blog post (Cock, 2015) by Peter Cock. The functionality remains unchanged to this day, and the BLAST documentation (2008-) (last modified in 2016) fails to clarify the misconception a reasonable user would have upon reading the manual. The confusion is further compounded by the fact that in the online BLAST portal, the max_target_seqs parameter behaves in the expected way – the best (rather than first) N hits are returned."

Could anyone confirm this?

@peterjc
Copy link

peterjc commented Oct 23, 2018

Noting here, given I've just replied to a comment on https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html

Regarding the different argument limit options for human readable output (-num_descriptions and -num_alignments) and computer readable output (-max_target_seqs), internally both are mapped to a limit on the number of alignments, and this same surprising behaviour can be demonstrated with plain text output and -num_descriptions and -num_alignments in exactly the same as the computer readable formats like tabular with -max_target_seqs.

I am working on a followup post of some sort with more details..

@peterjc
Copy link

peterjc commented Nov 1, 2018

Over at https://github.com/peterjc/blast_max_target_seqs I have created a small test case based on @sujaikumar's example above.

I have a draft blog post introducing the new test case, and showing the that the same surprising behaviour seen with -max_target_seqs also happens with -num_descriptions and -num_alignments ready to go live today or tomorrow (it needs proof reading first).

@peterjc
Copy link

peterjc commented Nov 2, 2018

My new blog post is up at https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html

This introduces the minimal test case https://github.com/peterjc/blast_max_target_seqs based on @sujaik's 2015 report (this gist), and highlights that both -max_target_seqs AND -num_descriptions / -num_alignments can change the top hit.

Also makes the point that these are documented under different sections of the command line help, as noted above https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5#gistcomment-2185621 by @terrycojones

@peterjc
Copy link

peterjc commented Nov 2, 2018

Part two is up at https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html

Here I focus on the question of if database order is important (as claimed in Shar et al. 2018), and how exactly the internal alignment number limit works.

As a post script I discussed https://www.biostars.org/p/341227 which is an elegant minimal test case, again based on Sujai's original.

@peterjc
Copy link

peterjc commented Nov 15, 2018

Nidhi Shah has shared a test case at https://github.com/shahnidhi/BLAST_maxtargetseq_analysis (they got in touch via a comment on my blog).

Blog part three looks at the test case from Nidhi Shah https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html

Blog part four at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used) https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html

@peterjc
Copy link

peterjc commented May 23, 2019

Belatedly adding links to two blog posts on this issue, which I think wraps things up.

Dec 2018: Examining the way BLAST database order is defined, and implications when building your own databases: https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html

Jan 2019: A summary post looking at the other issue which was causing the strange results in the Shah et al 2018 test case (fixed in BLAST+ 2.8.1): https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment