Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
NCBI blastp bug - changing max_target_seqs returns incorrect top hits

NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.

The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.

At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)

To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

Enter the following FASTA sequence in the query text box:

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:

  • The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.

If you now change the Max target sequences to 500, and rerun, you see:

  • The top hit is Bacteria (Burkholderia kururiensis 2e-40)

This is reproducible on the command line (using the versions of blastp mentioned above):

blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt

Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/

Screenshots attached:

  1. max100.png - https://www.dropbox.com/s/sbfaviez7hon4it/max100.png?dl=0
  2. max500.png - https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0
@sujaikumar
Owner

Thanks for testing and reproducing, Peter.

Nick Loman points out that this might be intended behaviour (https://twitter.com/pathogenomenick/status/671351106866438145) but it doesn't seem to be based on documentation at http://www.ncbi.nlm.nih.gov/books/NBK279682/ . Although it doesn't explicitly say that the best hits will always be returned on that page, am sure I've seen it somewhere (can someone find out where) that the results are always returned sorted by best e-value.

@peterjc
peterjc commented Nov 30, 2015

Reproducing with BLAST+ 2.2.18, other than slight changes to the e-values, the same seems to occur - so this is a long standing bug/feature:

$ md5sum /home/pc40583/bin/ncbi_blast/2.2.18+/blastp
8524801e5e1004fe61e484142878f78b  /home/pc40583/bin/ncbi_blast/2.2.18+/blastp
$ /home/pc40583/bin/ncbi_blast/2.2.18+/blastp  -query input.fasta -db /mnt/scratch/local/blast/ncbi/nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 > out.1e-5.max500.txt
$ /home/pc40583/bin/ncbi_blast/2.2.18+/blastp  -query input.fasta -db /mnt/scratch/local/blast/ncbi/nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 > out.1e-5.max100.txt
$ wc -l out.1e-5.max*
  100 out.1e-5.max100.txt
  500 out.1e-5.max500.txt
  600 total
$ head out.1e-5.max100.txt 
nHd.2.3.1.t00019-RA gi|669329196|gb|KFD69381.1| 61.98   121 46  0   1   121 466 586 1e-31    141
nHd.2.3.1.t00019-RA gi|669305262|gb|KFD48812.1| 61.98   121 46  0   1   121 419 539 2e-31    141
nHd.2.3.1.t00019-RA gi|730369345|gb|KHJ41189.1| 61.98   121 46  0   1   121 39  159 2e-31    141
nHd.2.3.1.t00019-RA gi|669226499|emb|CDW52156.1|    61.98   121 46  0   1   121 97  217 2e-31    141
nHd.2.3.1.t00019-RA gi|339246111|ref|XP_003374689.1|    63.11   122 45  0   1   122 154 275 8e-31    139
nHd.2.3.1.t00019-RA gi|658900651|ref|XP_008433149.1|    61.60   125 47  1   1   125 100 223 9e-30    135
nHd.2.3.1.t00019-RA gi|432921347|ref|XP_004080113.1|    60.80   125 48  1   1   125 100 223 3e-29    134
nHd.2.3.1.t00019-RA gi|195034519|ref|XP_001988914.1|    59.68   124 49  1   1   124 100 222 5e-29    133
nHd.2.3.1.t00019-RA gi|617359225|ref|XP_007556854.1|    59.20   125 50  1   1   125 100 223 5e-29    133
nHd.2.3.1.t00019-RA gi|551508704|ref|XP_005805880.1|    59.20   125 50  1   1   125 100 223 6e-29    132
$ head out.1e-5.max500.txt 
nHd.2.3.1.t00019-RA gi|754947128|ref|WP_042303394.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|516385675|ref|WP_017775351.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|654766970|ref|WP_028221716.1|    56.20   121 52  1   4   124 93  212 5e-34    149
nHd.2.3.1.t00019-RA gi|737555247|ref|WP_035527941.1|    55.37   121 53  1   4   124 93  212 5e-33    146
nHd.2.3.1.t00019-RA gi|654754649|ref|WP_028209749.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|654776929|ref|WP_028231331.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|754910044|ref|WP_042268382.1|    52.89   121 56  1   4   124 93  212 2e-32    144
nHd.2.3.1.t00019-RA gi|557035281|gb|ESQ12730.1| 55.12   127 53  2   3   125 103 229 2e-32    144
nHd.2.3.1.t00019-RA gi|667791014|ref|XP_007874646.1|    51.16   129 59  2   1   125 134 262 3e-32    144
nHd.2.3.1.t00019-RA gi|654285791|ref|WP_027800859.1|    55.37   121 53  1   4   124 93  212 4e-32    143

I tried making a small test case using lots of sequence similar to this example input, either as a single database, or via a .pal alias to better match the NR case. However, thus far these have all been well behaved.

@peterjc
peterjc commented Nov 30, 2015

I was looking for a pattern in which chunk the hits came from, e.g. the "best" hit when limiting to 100 is from chunk 24:

$ blastdbcmd -db /mnt/scratch/local/blast/ncbi/nr.24 -entry "gi|669226499|emb|CDW52156.1|"
>gi|669226499|emb|CDW52156.1| Phosphorylase domain containing protein [Trichuris trichiura]
MLTDRDRRKQISIRGIAQVENVANMKKTFNQHLHFTMIKDRNVATPRDYYFSLAHTVRDYLVSRWIRTQQHYHEKDPKRV
YYLSLEFYMGRTLSNTMLNLGIQATCDEAMYQLGLDIEELQEIEEDAGLGNGGLGRLAACFLDSMATLGVPAYGYGLRYE
YGIFKQAIKGGMQVEEPDDWLRFGNPWEKARPEYMLPINFYGRVIKDEKGRSHWIDTMLLFAMPYDTPVPGYQNNVVNTL
RLWSAKAENHFNLTFFNDGDYIQAVLDRNSAENITRVLYPNDNFFEGRELRLKQQYFLVAATLQDIVRRYRTYKDSKGNW
RSFDHFPDKVAIQLNDTHPSLAIPELMRLLVDVEGLPWEKAWNISVKTFAYTNHTVLPEALERWPVSLLGHLLPRHLEII
YEINQKFLDEVLRRWPADMDRIRRMSLVEEADQYGEKRINMAHLCIVGSHAINGVAALHSEILPFTTFTNCFRIASKIRQ
TESRPVRIGELWITDLAHLSQLKQYADNTGFLESLRRVKQENKMRAVQWLADVYKIEVNPSSMFDIQVKRIHEYKRQLLN
LMHVITMYNRIKKDPSTPVVPRSVMIGGKAAPGYHMAKQIIRLINVVADVINNDPVVGDKLKLIYLENYRVTLAEKIIPA
ADLSQQISTAGTEASGTGNMKFMLNGALTIGTLDGANVEMAEEMGRENIFIFGMTVQEVEALQAKGYNANDYIQRNPELK
QIIDQIETGFFTPSNPDMFKDVANVLKNHDRCNFFTFQHCLFAMRRLRSLHKVSRGSEQNDPPRWLRMSLYNIASSGKFS
SDRTIKDYCREIWDVPTTLERLPAPFEGPPSAEQVSKPEPAPTAAAAAAPAPAPARAPIAPTKQAATQSSLPHVKHVGSA
TGRQKGA

However, the expected best hit with no limit or rather limiting to 500, is from chunk 29:

$ blastdbcmd -db /mnt/scratch/local/blast/ncbi/nr.29 -entry "gi|754947128|ref|WP_042303394.1|"
>gi|754947128|ref|WP_042303394.1| maltodextrin phosphorylase [Burkholderia kururiensis]
MTTVDLEFDQLNSTVEALRRSISNRLMYGVGKDAVAARPQDWLHAAALAVRDRLVARWMKTTRMQYEQDAKRVYYLSMEF
LIGRTFTNALLALGIYDPMKEALASLGVDMEALTDLEPDAALGNGGLGRLAACFLDSMATLGIPGFGYGIRYEYGMFRQE
IVDGEQVEAPDYWLRAGNPWEFPRPEVQYVVHFGGRTVQREGRIEWIDTQHVNAMAYDTVIPGFATSATNTLRLWSARAT
EEFDLFAFNQGDYQRAVEARNASENVSRLLYPDDSTPAGRELRLRQEYFFVSATMQDLIRRYLRTHSTFGRFSEKVAVHL
NDTHPVLAIPELLRLLVDVHHVPWDKALAHVQEMFSYTNHTLMPEALETWDVELLARLLPRHLEIIFDINAEFLKQVSEH
SDHDVDLIRRISLVDEYGQRRVRMAHLAIVASQKVNGVSKLHSQLMTRDIFADFARMYPERFTNVTNGITPRRWLAQASP
SLSSLIDAQIGTHWRTDLFELAKLRELRNDPAFIEAFREAKRQNKIRLGHRLMHQTGVSFDPDALFDLQVKRIHEYKRQL
LNVLHVIVRYNRIRENPQRDWVPRVVMFAGKAASAYRMAKTIIKLINDVAAKINNDPAVGDRLKVVFVPNYGVSVAELII
PAADLSEQISMAGTEASGTGNMKLALNGALTIGTLDGANIEIRDAVGKDNIFIFGHSSDEVDDLRAGGYRPRQIYEENAE
LHTALDQIRTGFFSPGDPLRFSDIFHTLVDWGDHYMVLADFASFAKTQEEVDRRFRDSRAWDQSAIENVAGMGFFSSDRT
IAEYARDIWRVKPLEMG

I wrote a quick Python script to get the chunk number via repeated calls to blastdbcmd (slow but it works),

#Reads BLAST tabular output, e.g. from
#/home/pc40583/bin/ncbi_blast/2.2.31+/blastp  -query input.fasta -db /mnt/scratch/local/blast/ncbi/nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 > out.1e-5.max100.txt

import os
import sys

chunked_db = "/mnt/scratch/local/blast/ncbi/nr"
chunks = 40  # currently 0 to 39, so set chunks to 40

for filename in sys.argv[1:]:
    sys.stderr.write("NR chunk checking for results in %s\n" % filename)
    for line in open(filename):
        hit = line.split("\t", 3)[1]
        chunk = None
        for i in range(chunks):
            cmd = 'blastdbcmd -db %s.%02i -entry "%s" > /dev/null 2> /dev/null' % (chunked_db, i, hit)
            if 0 == os.system(cmd):
                chunk = i
                break
        try:
            if chunk is None:
                print("???\t" + line.rstrip("\n"))
            else:
                print("%02i\t%s" % (chunk, line.rstrip("\n")))
        except IOError:
            # Probably a broken pipe, abort
            sys.exit(1)

The output using the BLAST+ 2.2.18 results:

$ head out.1e-5.max100.txt 
nHd.2.3.1.t00019-RA gi|669329196|gb|KFD69381.1| 61.98   121 46  0   1   121 466 586 1e-31    141
nHd.2.3.1.t00019-RA gi|669305262|gb|KFD48812.1| 61.98   121 46  0   1   121 419 539 2e-31    141
nHd.2.3.1.t00019-RA gi|730369345|gb|KHJ41189.1| 61.98   121 46  0   1   121 39  159 2e-31    141
nHd.2.3.1.t00019-RA gi|669226499|emb|CDW52156.1|    61.98   121 46  0   1   121 97  217 2e-31    141
nHd.2.3.1.t00019-RA gi|339246111|ref|XP_003374689.1|    63.11   122 45  0   1   122 154 275 8e-31    139
nHd.2.3.1.t00019-RA gi|658900651|ref|XP_008433149.1|    61.60   125 47  1   1   125 100 223 9e-30    135
nHd.2.3.1.t00019-RA gi|432921347|ref|XP_004080113.1|    60.80   125 48  1   1   125 100 223 3e-29    134
nHd.2.3.1.t00019-RA gi|195034519|ref|XP_001988914.1|    59.68   124 49  1   1   124 100 222 5e-29    133
nHd.2.3.1.t00019-RA gi|617359225|ref|XP_007556854.1|    59.20   125 50  1   1   125 100 223 5e-29    133
nHd.2.3.1.t00019-RA gi|551508704|ref|XP_005805880.1|    59.20   125 50  1   1   125 100 223 6e-29    132
$ python chunk.py out.1e-5.max100.txt | head
NR chunk checking for results in out.1e-5.max100.txt
24  nHd.2.3.1.t00019-RA gi|669329196|gb|KFD69381.1| 61.98   121 46  0   1   121 466 586 1e-31    141
24  nHd.2.3.1.t00019-RA gi|669305262|gb|KFD48812.1| 61.98   121 46  0   1   121 419 539 2e-31    141
27  nHd.2.3.1.t00019-RA gi|730369345|gb|KHJ41189.1| 61.98   121 46  0   1   121 39  159 2e-31    141
24  nHd.2.3.1.t00019-RA gi|669226499|emb|CDW52156.1|    61.98   121 46  0   1   121 97  217 2e-31    141
04  nHd.2.3.1.t00019-RA gi|339246111|ref|XP_003374689.1|    63.11   122 45  0   1   122 154 275 8e-31    139
23  nHd.2.3.1.t00019-RA gi|658900651|ref|XP_008433149.1|    61.60   125 47  1   1   125 100 223 9e-30    135
09  nHd.2.3.1.t00019-RA gi|432921347|ref|XP_004080113.1|    60.80   125 48  1   1   125 100 223 3e-29    134
20  nHd.2.3.1.t00019-RA gi|195034519|ref|XP_001988914.1|    59.68   124 49  1   1   124 100 222 5e-29    133
16  nHd.2.3.1.t00019-RA gi|617359225|ref|XP_007556854.1|    59.20   125 50  1   1   125 100 223 5e-29    133
13  nHd.2.3.1.t00019-RA gi|551508704|ref|XP_005805880.1|    59.20   125 50  1   1   125 100 223 6e-29    132

And from the full / 500 limit output:

$ head out.1e-5.max500.txt 
nHd.2.3.1.t00019-RA gi|754947128|ref|WP_042303394.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|516385675|ref|WP_017775351.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|654766970|ref|WP_028221716.1|    56.20   121 52  1   4   124 93  212 5e-34    149
nHd.2.3.1.t00019-RA gi|737555247|ref|WP_035527941.1|    55.37   121 53  1   4   124 93  212 5e-33    146
nHd.2.3.1.t00019-RA gi|654754649|ref|WP_028209749.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|654776929|ref|WP_028231331.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|754910044|ref|WP_042268382.1|    52.89   121 56  1   4   124 93  212 2e-32    144
nHd.2.3.1.t00019-RA gi|557035281|gb|ESQ12730.1| 55.12   127 53  2   3   125 103 229 2e-32    144
nHd.2.3.1.t00019-RA gi|667791014|ref|XP_007874646.1|    51.16   129 59  2   1   125 134 262 3e-32    144
nHd.2.3.1.t00019-RA gi|654285791|ref|WP_027800859.1|    55.37   121 53  1   4   124 93  212 4e-32    143
$ python chunk.py out.1e-5.max500.txt | head
NR chunk checking for results in out.1e-5.max500.txt
29  nHd.2.3.1.t00019-RA gi|754947128|ref|WP_042303394.1|    58.68   121 49  1   4   124 93  212 4e-35    153
11  nHd.2.3.1.t00019-RA gi|516385675|ref|WP_017775351.1|    58.68   121 49  1   4   124 93  212 4e-35    153
23  nHd.2.3.1.t00019-RA gi|654766970|ref|WP_028221716.1|    56.20   121 52  1   4   124 93  212 5e-34    149
26  nHd.2.3.1.t00019-RA gi|737555247|ref|WP_035527941.1|    55.37   121 53  1   4   124 93  212 5e-33    146
23  nHd.2.3.1.t00019-RA gi|654754649|ref|WP_028209749.1|    53.72   121 55  1   4   124 93  212 6e-33    146
23  nHd.2.3.1.t00019-RA gi|654776929|ref|WP_028231331.1|    53.72   121 55  1   4   124 93  212 6e-33    146
29  nHd.2.3.1.t00019-RA gi|754910044|ref|WP_042268382.1|    52.89   121 56  1   4   124 93  212 2e-32    144
13  nHd.2.3.1.t00019-RA gi|557035281|gb|ESQ12730.1| 55.12   127 53  2   3   125 103 229 2e-32    144
09  nHd.2.3.1.t00019-RA gi|667791014|ref|XP_007874646.1|    51.16   129 59  2   1   125 134 262 3e-32    144
23  nHd.2.3.1.t00019-RA gi|654285791|ref|WP_027800859.1|    55.37   121 53  1   4   124 93  212 4e-32    143

I see no obvious pattern here in terms of the NR database chunk number the hits came from in how the subset from -max_target_seqs 100 is being selected. They both contain hits from chunk 13 for example.

@sujaikumar
Owner

Issue resolved thanks to prompt reply from @NCBI blast-help@ncbi.nlm.nih.gov

Hello,
Thank you for the report. We don't consider this a bug, but I agree that we should document this possibility better. This can happen because limits, including max target sequences, are applied in an early ungapped phase of the algorithm, as well as later. In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits. In your case, relaxing the limit to 200 appears to have allowed hits that would have been excluded in the ungapped phase at 100 max target sequences to rise.

@peterjc
peterjc commented Nov 30, 2015

I'd agree with the message from the NCBI that this needs to be documented better. However, I still think most users would regard it as a bug rather than a feature, and as you've shown, a "feature" capable of giving some potentially very surprising behaviour. Thanks for raising this Sujai.

@damiankao

So that means max_target_seqs has nothing to do with reporting. It is actually a parameter that modifies the algorithm behavior. This is extremely poor documentation on NCBI's part.

@peterjc
peterjc commented Dec 1, 2015

Sujai kindly CC'd me on his email to the NCBI, so I also got their reply (posted above). I've have replied adding:

Quoting my email: Could I also ask does the same apply to -num_descriptions and -num_alignments, or are those actually applied as a final filter? If the later, this partly explains my confusion and frustration with #5 on http://blastedbio.blogspot.co.uk/2014/12/blast-christmas-wish-list.html

@peterjc
peterjc commented Dec 1, 2015

P.S. Here's the original Tweet from Sujai which also garnered a lot of comments https://twitter.com/sujaik/status/671333856461660160 including confirmation of the surprising behaviour in the NCBI online BLAST server by @drchriscole

@harmn
harmn commented Dec 3, 2015

The same actually holds true for the e-value theshold (at least in version 2.2.28+), this is from an email from the NCBI User Services:

The cutoffs are applied during an ungapped extension phase, and some
alignments may improve in score during gapped extension. The safest
approach is to relax (increase) the threshold and deal with a larger
result set.

@peterjc
peterjc commented Dec 18, 2015

Thanks @harmn - that's another surprising design choice. I wonder if this was true back in "legacy" BLAST too?

P.S. I put up a short blog post on this bug: http://blastedbio.blogspot.co.uk/2015/12/blast-max-target-sequences-bug.html

@tshalev
tshalev commented Apr 23, 2016 edited

Really interesting - so would it be best to just run everything with default settings and then filter later? I also wonder whether blasting into archival format (11) with only default settings and then using max_target_seqs when formatting to other formats would get you the results you desire...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment