Skip to content

Instantly share code, notes, and snippets.

@sujaikumar
Last active November 2, 2023 04:46
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
Save sujaikumar/504b3b7024eaf3a04ef5 to your computer and use it in GitHub Desktop.
NCBI blastp bug - changing max_target_seqs returns incorrect top hits

NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.

The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.

At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)

To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

Enter the following FASTA sequence in the query text box:

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:

  • The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.

If you now change the Max target sequences to 500, and rerun, you see:

  • The top hit is Bacteria (Burkholderia kururiensis 2e-40)

This is reproducible on the command line (using the versions of blastp mentioned above):

blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt

Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/

Screenshots attached:

  1. max100.png - https://www.dropbox.com/s/sbfaviez7hon4it/max100.png?dl=0
  2. max500.png - https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0
@harmn
Copy link

harmn commented Dec 3, 2015

The same actually holds true for the e-value theshold (at least in version 2.2.28+), this is from an email from the NCBI User Services:

The cutoffs are applied during an ungapped extension phase, and some
alignments may improve in score during gapped extension. The safest
approach is to relax (increase) the threshold and deal with a larger
result set.

@peterjc
Copy link

peterjc commented Dec 18, 2015

Thanks @harmn - that's another surprising design choice. I wonder if this was true back in "legacy" BLAST too?

P.S. I put up a short blog post on this bug: http://blastedbio.blogspot.co.uk/2015/12/blast-max-target-sequences-bug.html

@tshalev
Copy link

tshalev commented Apr 23, 2016

Really interesting - so would it be best to just run everything with default settings and then filter later? I also wonder whether blasting into archival format (11) with only default settings and then using max_target_seqs when formatting to other formats would get you the results you desire...

@terrycojones
Copy link

terrycojones commented Aug 25, 2017

I've just run across this gist - thanks all for the digging and commenting. I saw a really alarming result last night and spent most of today trying to figure out what was happening.

The short summary is that I get a strong hit for a sequence from one database but when I make a slightly bigger database and re-run BLAST the hit does not appear at all. I eventually managed to find the hit in the output by increasing any one of the -num_alignments, -num_descriptions, or -max_target_seqs options (depending on -outfmt).

This seems like a DREADFUL PROBLEM. One of the top hits (the 5th) simply vanishes from the (default) BLAST output just because the database is slightly larger. How many unwitting BLAST users are there out there who are assuming that BLAST always shows them the top results, i.e., that -max_target_seqs and its ilk just affect what is displayed and not the search itself. I've been using BLAST for some years now and am a bit in shock.... how many strong hits have I missed because of this? And what about the wider world of BLAST users? The mind boggles.

Below is my write up.


I have a strange situation. I have two BLAST databases (20170821 and 20170824) that both contain a certain subject sequence LC074724. When I query them both with a given query, Q, in one case (20170824) there is a very strong match and in the other (20170821) no match at all.

I'm trying to figure out why.

BLAST version

$ blastn -version
blastn: 2.6.0+
 Package: blast 2.6.0, build May 11 2017 22:22:40

The subject sequence is the same in both databases

Both databases contain an identical subject sequence:

$ blastdbcmd -entry LC074724 -db 20170824 | md5
fcfbc167c7f81dfd75aad1bf3db6220b

$ blastdbcmd -entry LC074724 -db 20170821 | md5
fcfbc167c7f81dfd75aad1bf3db6220b

Doing a match

There is a good match of Q against LC074724 in 20170824:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170824 -query Q.fasta \
  -task blastn -max_hsps 1 | grep LC074724
Q	LC074724.1	2581

Note that the bitscore of the LC074724 match is 2581.

But there is no match of LC074724 at all in 20170821:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
-task blastn -max_hsps 1 | grep LC074724

The option -outfmt '6 qaccver saccver bitscore' tells BLAST to print the query accession number and version, the subject accession and version, and the bitscore of the match.

What are the best hits for 20170821

So what are the bitscores of the Q matches in 20170821?

$ blastn -outfmt '6 bitscore' -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | sort -nr | head
3301
2616
2590
2590
2563
2554
2547
2544
2540
2540

The bitscore of Q against LC074724 in 20170824 (2581) would have placed it in 5th position in the top 10 matches of Q, but LC074724 is not matched at all in 20170821! The lowest bitscores matched in 20170821 is 2354.

What's the difference between the databases?

Get the ids of the subjects in each database

$ blastdbcmd -entry all -db 20170824 | egrep '^>' | cut -c2- | sort > 20170824.ids
$ blastdbcmd -entry all -db 20170821 | egrep '^>' | cut -c2- | sort > 20170821.ids
$ wc -l 2017082?.ids
    9754 20170821.ids
    9111 20170824.ids
   18865 total

20170821 has 643 sequences that are not in 20170824:

$ comm -23 20170821.ids 20170824.ids | wc -l
     643

20170824 has no sequences that are not in 20170821:

$ comm -13 20170821.ids 20170824.ids | wc -l
       0

and the two have 9111 sequences in common:

$ comm -12 20170821.ids 20170824.ids | wc -l
    9111

These numbers match with the numbers from wc -l above.

So 20170821 is just 20170824 with an extra 643 sequences.

Maybe LC074724 matches, but with a low bit score?

BLAST only shows the top 500 matched subjects by default. Let's ask it to show more:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
  -task blastn -max_hsps 1 -max_target_seqs 10000 > 20170821-10000-matches.txt

And....

$ head 20170821-10000-matches.txt
Q	Q	3301
Q	X234A	2616
Q	CS388973.1	2590
Q	DM059402.1	2590
Q	LC074724.1	2581
Q	KJ843188.1	2576
Q	AB697496.1	2576
Q	AB697499.1	2576
Q	AB697503.1	2576
Q	AB697508.1	2576

Holy shit, there it is! And it's in 5th position, just as it should be (based on the bitscores discussed above).

Just to confirm, taking away the -max_target_seqs option gets no match:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
  -task blastn -max_hsps 1 | grep LC074724
$

HOW ON EARTH CAN THAT BE HAPPENING?

Three confusingly named BLAST options

I've always found the names and descriptions of the following three BLAST command-line options very confusing.

From the Formatting options section of blastn -help:

-num_descriptions <Integer, >=0>
  Number of database sequences to show one-line descriptions for
  Not applicable for outfmt > 4
  Default = `500'
   * Incompatible with:  max_target_seqs

-num_alignments <Integer, >=0>
  Number of database sequences to show alignments for
  Default = `250'
   * Incompatible with:  max_target_seqs

And from the Restrict search or results section:

-max_target_seqs <Integer, >=1>
  Maximum number of aligned sequences to keep
  Not applicable for outfmt <= 4
  Default = `500'
   * Incompatible with:  num_descriptions, num_alignments

From the section names, it seems like only the latter option (-max_target_seqs) would have any effect on the search, and that the former two are just about what is displayed.

But... using either just -num_descriptions or -num_alignments with a big value also gets a match:

$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_descriptions 10000 | grep LC074724
LC074724.1  Hepatitis B virus DNA, complete genome, isolate: p621     2581    0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621
$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_alignments 10000 | grep LC074724
LC074724.1  Hepatitis B virus DNA, complete genome, isolate: p621     2581    0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621

So BLAST is apparently finding the LC074724 match even without the high -max_target_seqs option, but it's not displaying it unless there is a high -num_alignments or -num_descriptions value.

That makes no sense at all. BLAST is finding the match without the -max_target_seqs option and the match has the 5th highest score, so why is it not being displayed?

I don't understand the difference between -num_alignments and -num_descriptions.

Maybe -num_alignments is per query and -num_descriptions is in the overall result set (i.e., across all queries). BLAST seems to use "alignment" to refer to a match between a query and a single subject (and that match may have multiple high-scoring pairs (HSPs), each with its own bit score). But why is there a -num_descriptions option? I can only think that BLAST is allocating memory to hold subject descriptions and this option is used to allocate some fixed storage for the overall search. So, possibly, if the -num_descriptions limit is reached during the search, BLAST gives up because it knows it cannot store more subject descriptions. But only setting -num_alignments also gets me a match, so I don't know.

Just to confirm, when none of the 3 options are given and no -outfmt option is either, the match is not found:

$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | grep LC074724
$

WTF?

I've been using BLAST for the last 4 years or so, and have only today realized that something like this could occur.

It's of course super important.

And despite all the above, I still don't understand what's going on.

@terrycojones
Copy link

@tshalev, you ask "so would it be best to just run everything with default settings and then filter later?"

Unfortunately not. The problem is that with the default settings a very high-scoring match may be prematurely discarded in the ungapped stage. So if you just leave the defaults you'll never see it. You need to pass some increased value for -max_target_seqs (for -outfmt >= 4) or the other options (-outfmt < 4) but there is absolutely no guidance as to what value you should use. It's a terrible situation.

In my case (above) I now find that if I run with -max_target_seqs 506 I get the expected match in the 3rd highest position. If I use -max_target_seqs 505 that match is not present in the output. How can a user know what to set -max_target_seqs to when they don't even know what they might be missing? This problem is so serious it makes BLAST suddenly seem almost useless (or at least terribly unreliable), as you may have false negatives that are actually the very best matches. Maybe the only thing to do is set -max_target_seqs to some extremely high value (like the number of sequences in your database) and then filter the output. That approach is what the NCBI mail is suggesting we do (though presumably with a smaller -max_target_seqs than the entire database size - but how small?). I.e., do not rely on -max_target_seqs to filter things for you because it is applied too early in the BLAST matching procedure.

I hope I don't sound too cranky here. I just can't believe this is the default behavior of BLAST, that the documentation of this option is so poor, that the option's name is really misleading, that the impact is so severe (losing the very best hits!), and that there must be many many thousands of people using BLAST who (understandably) have no inkling or expectation that this would be the default behavior. All this in a program that has been around for so long, has had so much work done on it, etc. And where do people go to to discuss such a thing? Not to an issue tracker, or similar, but to gist.

Copy link

ghost commented Nov 2, 2017

@terrycojones thanks for your comment. Did you ever get an answer from blast-help@ncbi.nlm.nih.gov ?

@voiDnyx
Copy link

voiDnyx commented Nov 14, 2017

Is this still an issue? I found nothing in the release notes of recent ncbi-blast versions saying something changed. Also is this commonly known today? Because I was totally not expecting this behaviour.

@peterjc
Copy link

peterjc commented Sep 26, 2018

My blog post about this https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html (mentioned above) just got cited in a new paper about this NCBI BLAST+ bug/feature:

Shah et al. (2018)
Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows.
https://doi.org/10.1093/bioinformatics/bty833

@jarekbryk
Copy link

Note that the Shah et al paper suggest that this parameter behaves as expected in the online version of BLAST! This is a relevant fragment (emphasis mine):

"This functionality was first reported as a bug to NCBI by Sujai Kumar in 2015 (2015), and later docu- mented in a blog post (Cock, 2015) by Peter Cock. The functionality remains unchanged to this day, and the BLAST documentation (2008-) (last modified in 2016) fails to clarify the misconception a reasonable user would have upon reading the manual. The confusion is further compounded by the fact that in the online BLAST portal, the max_target_seqs parameter behaves in the expected way – the best (rather than first) N hits are returned."

Could anyone confirm this?

@peterjc
Copy link

peterjc commented Oct 23, 2018

Noting here, given I've just replied to a comment on https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html

Regarding the different argument limit options for human readable output (-num_descriptions and -num_alignments) and computer readable output (-max_target_seqs), internally both are mapped to a limit on the number of alignments, and this same surprising behaviour can be demonstrated with plain text output and -num_descriptions and -num_alignments in exactly the same as the computer readable formats like tabular with -max_target_seqs.

I am working on a followup post of some sort with more details..

@peterjc
Copy link

peterjc commented Nov 1, 2018

Over at https://github.com/peterjc/blast_max_target_seqs I have created a small test case based on @sujaikumar's example above.

I have a draft blog post introducing the new test case, and showing the that the same surprising behaviour seen with -max_target_seqs also happens with -num_descriptions and -num_alignments ready to go live today or tomorrow (it needs proof reading first).

@peterjc
Copy link

peterjc commented Nov 2, 2018

My new blog post is up at https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html

This introduces the minimal test case https://github.com/peterjc/blast_max_target_seqs based on @sujaik's 2015 report (this gist), and highlights that both -max_target_seqs AND -num_descriptions / -num_alignments can change the top hit.

Also makes the point that these are documented under different sections of the command line help, as noted above https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5#gistcomment-2185621 by @terrycojones

@peterjc
Copy link

peterjc commented Nov 2, 2018

Part two is up at https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html

Here I focus on the question of if database order is important (as claimed in Shar et al. 2018), and how exactly the internal alignment number limit works.

As a post script I discussed https://www.biostars.org/p/341227 which is an elegant minimal test case, again based on Sujai's original.

@peterjc
Copy link

peterjc commented Nov 15, 2018

Nidhi Shah has shared a test case at https://github.com/shahnidhi/BLAST_maxtargetseq_analysis (they got in touch via a comment on my blog).

Blog part three looks at the test case from Nidhi Shah https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html

Blog part four at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used) https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html

@peterjc
Copy link

peterjc commented May 23, 2019

Belatedly adding links to two blog posts on this issue, which I think wraps things up.

Dec 2018: Examining the way BLAST database order is defined, and implications when building your own databases: https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html

Jan 2019: A summary post looking at the other issue which was causing the strange results in the Shah et al 2018 test case (fixed in BLAST+ 2.8.1): https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment