Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
NCBI blastp bug - changing max_target_seqs returns incorrect top hits

NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.

The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.

At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)

To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

Enter the following FASTA sequence in the query text box:

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:

  • The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.

If you now change the Max target sequences to 500, and rerun, you see:

  • The top hit is Bacteria (Burkholderia kururiensis 2e-40)

This is reproducible on the command line (using the versions of blastp mentioned above):

blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt

Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/

Screenshots attached:

  1. max100.png - https://www.dropbox.com/s/sbfaviez7hon4it/max100.png?dl=0
  2. max500.png - https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0
Owner

sujaikumar commented Nov 30, 2015

Thanks for testing and reproducing, Peter.

Nick Loman points out that this might be intended behaviour (https://twitter.com/pathogenomenick/status/671351106866438145) but it doesn't seem to be based on documentation at http://www.ncbi.nlm.nih.gov/books/NBK279682/ . Although it doesn't explicitly say that the best hits will always be returned on that page, am sure I've seen it somewhere (can someone find out where) that the results are always returned sorted by best e-value.

peterjc commented Nov 30, 2015

Reproducing with BLAST+ 2.2.18, other than slight changes to the e-values, the same seems to occur - so this is a long standing bug/feature:

$ md5sum /home/pc40583/bin/ncbi_blast/2.2.18+/blastp
8524801e5e1004fe61e484142878f78b  /home/pc40583/bin/ncbi_blast/2.2.18+/blastp
$ /home/pc40583/bin/ncbi_blast/2.2.18+/blastp  -query input.fasta -db /mnt/scratch/local/blast/ncbi/nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 > out.1e-5.max500.txt
$ /home/pc40583/bin/ncbi_blast/2.2.18+/blastp  -query input.fasta -db /mnt/scratch/local/blast/ncbi/nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 > out.1e-5.max100.txt
$ wc -l out.1e-5.max*
  100 out.1e-5.max100.txt
  500 out.1e-5.max500.txt
  600 total
$ head out.1e-5.max100.txt 
nHd.2.3.1.t00019-RA gi|669329196|gb|KFD69381.1| 61.98   121 46  0   1   121 466 586 1e-31    141
nHd.2.3.1.t00019-RA gi|669305262|gb|KFD48812.1| 61.98   121 46  0   1   121 419 539 2e-31    141
nHd.2.3.1.t00019-RA gi|730369345|gb|KHJ41189.1| 61.98   121 46  0   1   121 39  159 2e-31    141
nHd.2.3.1.t00019-RA gi|669226499|emb|CDW52156.1|    61.98   121 46  0   1   121 97  217 2e-31    141
nHd.2.3.1.t00019-RA gi|339246111|ref|XP_003374689.1|    63.11   122 45  0   1   122 154 275 8e-31    139
nHd.2.3.1.t00019-RA gi|658900651|ref|XP_008433149.1|    61.60   125 47  1   1   125 100 223 9e-30    135
nHd.2.3.1.t00019-RA gi|432921347|ref|XP_004080113.1|    60.80   125 48  1   1   125 100 223 3e-29    134
nHd.2.3.1.t00019-RA gi|195034519|ref|XP_001988914.1|    59.68   124 49  1   1   124 100 222 5e-29    133
nHd.2.3.1.t00019-RA gi|617359225|ref|XP_007556854.1|    59.20   125 50  1   1   125 100 223 5e-29    133
nHd.2.3.1.t00019-RA gi|551508704|ref|XP_005805880.1|    59.20   125 50  1   1   125 100 223 6e-29    132
$ head out.1e-5.max500.txt 
nHd.2.3.1.t00019-RA gi|754947128|ref|WP_042303394.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|516385675|ref|WP_017775351.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|654766970|ref|WP_028221716.1|    56.20   121 52  1   4   124 93  212 5e-34    149
nHd.2.3.1.t00019-RA gi|737555247|ref|WP_035527941.1|    55.37   121 53  1   4   124 93  212 5e-33    146
nHd.2.3.1.t00019-RA gi|654754649|ref|WP_028209749.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|654776929|ref|WP_028231331.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|754910044|ref|WP_042268382.1|    52.89   121 56  1   4   124 93  212 2e-32    144
nHd.2.3.1.t00019-RA gi|557035281|gb|ESQ12730.1| 55.12   127 53  2   3   125 103 229 2e-32    144
nHd.2.3.1.t00019-RA gi|667791014|ref|XP_007874646.1|    51.16   129 59  2   1   125 134 262 3e-32    144
nHd.2.3.1.t00019-RA gi|654285791|ref|WP_027800859.1|    55.37   121 53  1   4   124 93  212 4e-32    143

I tried making a small test case using lots of sequence similar to this example input, either as a single database, or via a .pal alias to better match the NR case. However, thus far these have all been well behaved.

peterjc commented Nov 30, 2015

I was looking for a pattern in which chunk the hits came from, e.g. the "best" hit when limiting to 100 is from chunk 24:

$ blastdbcmd -db /mnt/scratch/local/blast/ncbi/nr.24 -entry "gi|669226499|emb|CDW52156.1|"
>gi|669226499|emb|CDW52156.1| Phosphorylase domain containing protein [Trichuris trichiura]
MLTDRDRRKQISIRGIAQVENVANMKKTFNQHLHFTMIKDRNVATPRDYYFSLAHTVRDYLVSRWIRTQQHYHEKDPKRV
YYLSLEFYMGRTLSNTMLNLGIQATCDEAMYQLGLDIEELQEIEEDAGLGNGGLGRLAACFLDSMATLGVPAYGYGLRYE
YGIFKQAIKGGMQVEEPDDWLRFGNPWEKARPEYMLPINFYGRVIKDEKGRSHWIDTMLLFAMPYDTPVPGYQNNVVNTL
RLWSAKAENHFNLTFFNDGDYIQAVLDRNSAENITRVLYPNDNFFEGRELRLKQQYFLVAATLQDIVRRYRTYKDSKGNW
RSFDHFPDKVAIQLNDTHPSLAIPELMRLLVDVEGLPWEKAWNISVKTFAYTNHTVLPEALERWPVSLLGHLLPRHLEII
YEINQKFLDEVLRRWPADMDRIRRMSLVEEADQYGEKRINMAHLCIVGSHAINGVAALHSEILPFTTFTNCFRIASKIRQ
TESRPVRIGELWITDLAHLSQLKQYADNTGFLESLRRVKQENKMRAVQWLADVYKIEVNPSSMFDIQVKRIHEYKRQLLN
LMHVITMYNRIKKDPSTPVVPRSVMIGGKAAPGYHMAKQIIRLINVVADVINNDPVVGDKLKLIYLENYRVTLAEKIIPA
ADLSQQISTAGTEASGTGNMKFMLNGALTIGTLDGANVEMAEEMGRENIFIFGMTVQEVEALQAKGYNANDYIQRNPELK
QIIDQIETGFFTPSNPDMFKDVANVLKNHDRCNFFTFQHCLFAMRRLRSLHKVSRGSEQNDPPRWLRMSLYNIASSGKFS
SDRTIKDYCREIWDVPTTLERLPAPFEGPPSAEQVSKPEPAPTAAAAAAPAPAPARAPIAPTKQAATQSSLPHVKHVGSA
TGRQKGA

However, the expected best hit with no limit or rather limiting to 500, is from chunk 29:

$ blastdbcmd -db /mnt/scratch/local/blast/ncbi/nr.29 -entry "gi|754947128|ref|WP_042303394.1|"
>gi|754947128|ref|WP_042303394.1| maltodextrin phosphorylase [Burkholderia kururiensis]
MTTVDLEFDQLNSTVEALRRSISNRLMYGVGKDAVAARPQDWLHAAALAVRDRLVARWMKTTRMQYEQDAKRVYYLSMEF
LIGRTFTNALLALGIYDPMKEALASLGVDMEALTDLEPDAALGNGGLGRLAACFLDSMATLGIPGFGYGIRYEYGMFRQE
IVDGEQVEAPDYWLRAGNPWEFPRPEVQYVVHFGGRTVQREGRIEWIDTQHVNAMAYDTVIPGFATSATNTLRLWSARAT
EEFDLFAFNQGDYQRAVEARNASENVSRLLYPDDSTPAGRELRLRQEYFFVSATMQDLIRRYLRTHSTFGRFSEKVAVHL
NDTHPVLAIPELLRLLVDVHHVPWDKALAHVQEMFSYTNHTLMPEALETWDVELLARLLPRHLEIIFDINAEFLKQVSEH
SDHDVDLIRRISLVDEYGQRRVRMAHLAIVASQKVNGVSKLHSQLMTRDIFADFARMYPERFTNVTNGITPRRWLAQASP
SLSSLIDAQIGTHWRTDLFELAKLRELRNDPAFIEAFREAKRQNKIRLGHRLMHQTGVSFDPDALFDLQVKRIHEYKRQL
LNVLHVIVRYNRIRENPQRDWVPRVVMFAGKAASAYRMAKTIIKLINDVAAKINNDPAVGDRLKVVFVPNYGVSVAELII
PAADLSEQISMAGTEASGTGNMKLALNGALTIGTLDGANIEIRDAVGKDNIFIFGHSSDEVDDLRAGGYRPRQIYEENAE
LHTALDQIRTGFFSPGDPLRFSDIFHTLVDWGDHYMVLADFASFAKTQEEVDRRFRDSRAWDQSAIENVAGMGFFSSDRT
IAEYARDIWRVKPLEMG

I wrote a quick Python script to get the chunk number via repeated calls to blastdbcmd (slow but it works),

#Reads BLAST tabular output, e.g. from
#/home/pc40583/bin/ncbi_blast/2.2.31+/blastp  -query input.fasta -db /mnt/scratch/local/blast/ncbi/nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 > out.1e-5.max100.txt

import os
import sys

chunked_db = "/mnt/scratch/local/blast/ncbi/nr"
chunks = 40  # currently 0 to 39, so set chunks to 40

for filename in sys.argv[1:]:
    sys.stderr.write("NR chunk checking for results in %s\n" % filename)
    for line in open(filename):
        hit = line.split("\t", 3)[1]
        chunk = None
        for i in range(chunks):
            cmd = 'blastdbcmd -db %s.%02i -entry "%s" > /dev/null 2> /dev/null' % (chunked_db, i, hit)
            if 0 == os.system(cmd):
                chunk = i
                break
        try:
            if chunk is None:
                print("???\t" + line.rstrip("\n"))
            else:
                print("%02i\t%s" % (chunk, line.rstrip("\n")))
        except IOError:
            # Probably a broken pipe, abort
            sys.exit(1)

The output using the BLAST+ 2.2.18 results:

$ head out.1e-5.max100.txt 
nHd.2.3.1.t00019-RA gi|669329196|gb|KFD69381.1| 61.98   121 46  0   1   121 466 586 1e-31    141
nHd.2.3.1.t00019-RA gi|669305262|gb|KFD48812.1| 61.98   121 46  0   1   121 419 539 2e-31    141
nHd.2.3.1.t00019-RA gi|730369345|gb|KHJ41189.1| 61.98   121 46  0   1   121 39  159 2e-31    141
nHd.2.3.1.t00019-RA gi|669226499|emb|CDW52156.1|    61.98   121 46  0   1   121 97  217 2e-31    141
nHd.2.3.1.t00019-RA gi|339246111|ref|XP_003374689.1|    63.11   122 45  0   1   122 154 275 8e-31    139
nHd.2.3.1.t00019-RA gi|658900651|ref|XP_008433149.1|    61.60   125 47  1   1   125 100 223 9e-30    135
nHd.2.3.1.t00019-RA gi|432921347|ref|XP_004080113.1|    60.80   125 48  1   1   125 100 223 3e-29    134
nHd.2.3.1.t00019-RA gi|195034519|ref|XP_001988914.1|    59.68   124 49  1   1   124 100 222 5e-29    133
nHd.2.3.1.t00019-RA gi|617359225|ref|XP_007556854.1|    59.20   125 50  1   1   125 100 223 5e-29    133
nHd.2.3.1.t00019-RA gi|551508704|ref|XP_005805880.1|    59.20   125 50  1   1   125 100 223 6e-29    132
$ python chunk.py out.1e-5.max100.txt | head
NR chunk checking for results in out.1e-5.max100.txt
24  nHd.2.3.1.t00019-RA gi|669329196|gb|KFD69381.1| 61.98   121 46  0   1   121 466 586 1e-31    141
24  nHd.2.3.1.t00019-RA gi|669305262|gb|KFD48812.1| 61.98   121 46  0   1   121 419 539 2e-31    141
27  nHd.2.3.1.t00019-RA gi|730369345|gb|KHJ41189.1| 61.98   121 46  0   1   121 39  159 2e-31    141
24  nHd.2.3.1.t00019-RA gi|669226499|emb|CDW52156.1|    61.98   121 46  0   1   121 97  217 2e-31    141
04  nHd.2.3.1.t00019-RA gi|339246111|ref|XP_003374689.1|    63.11   122 45  0   1   122 154 275 8e-31    139
23  nHd.2.3.1.t00019-RA gi|658900651|ref|XP_008433149.1|    61.60   125 47  1   1   125 100 223 9e-30    135
09  nHd.2.3.1.t00019-RA gi|432921347|ref|XP_004080113.1|    60.80   125 48  1   1   125 100 223 3e-29    134
20  nHd.2.3.1.t00019-RA gi|195034519|ref|XP_001988914.1|    59.68   124 49  1   1   124 100 222 5e-29    133
16  nHd.2.3.1.t00019-RA gi|617359225|ref|XP_007556854.1|    59.20   125 50  1   1   125 100 223 5e-29    133
13  nHd.2.3.1.t00019-RA gi|551508704|ref|XP_005805880.1|    59.20   125 50  1   1   125 100 223 6e-29    132

And from the full / 500 limit output:

$ head out.1e-5.max500.txt 
nHd.2.3.1.t00019-RA gi|754947128|ref|WP_042303394.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|516385675|ref|WP_017775351.1|    58.68   121 49  1   4   124 93  212 4e-35    153
nHd.2.3.1.t00019-RA gi|654766970|ref|WP_028221716.1|    56.20   121 52  1   4   124 93  212 5e-34    149
nHd.2.3.1.t00019-RA gi|737555247|ref|WP_035527941.1|    55.37   121 53  1   4   124 93  212 5e-33    146
nHd.2.3.1.t00019-RA gi|654754649|ref|WP_028209749.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|654776929|ref|WP_028231331.1|    53.72   121 55  1   4   124 93  212 6e-33    146
nHd.2.3.1.t00019-RA gi|754910044|ref|WP_042268382.1|    52.89   121 56  1   4   124 93  212 2e-32    144
nHd.2.3.1.t00019-RA gi|557035281|gb|ESQ12730.1| 55.12   127 53  2   3   125 103 229 2e-32    144
nHd.2.3.1.t00019-RA gi|667791014|ref|XP_007874646.1|    51.16   129 59  2   1   125 134 262 3e-32    144
nHd.2.3.1.t00019-RA gi|654285791|ref|WP_027800859.1|    55.37   121 53  1   4   124 93  212 4e-32    143
$ python chunk.py out.1e-5.max500.txt | head
NR chunk checking for results in out.1e-5.max500.txt
29  nHd.2.3.1.t00019-RA gi|754947128|ref|WP_042303394.1|    58.68   121 49  1   4   124 93  212 4e-35    153
11  nHd.2.3.1.t00019-RA gi|516385675|ref|WP_017775351.1|    58.68   121 49  1   4   124 93  212 4e-35    153
23  nHd.2.3.1.t00019-RA gi|654766970|ref|WP_028221716.1|    56.20   121 52  1   4   124 93  212 5e-34    149
26  nHd.2.3.1.t00019-RA gi|737555247|ref|WP_035527941.1|    55.37   121 53  1   4   124 93  212 5e-33    146
23  nHd.2.3.1.t00019-RA gi|654754649|ref|WP_028209749.1|    53.72   121 55  1   4   124 93  212 6e-33    146
23  nHd.2.3.1.t00019-RA gi|654776929|ref|WP_028231331.1|    53.72   121 55  1   4   124 93  212 6e-33    146
29  nHd.2.3.1.t00019-RA gi|754910044|ref|WP_042268382.1|    52.89   121 56  1   4   124 93  212 2e-32    144
13  nHd.2.3.1.t00019-RA gi|557035281|gb|ESQ12730.1| 55.12   127 53  2   3   125 103 229 2e-32    144
09  nHd.2.3.1.t00019-RA gi|667791014|ref|XP_007874646.1|    51.16   129 59  2   1   125 134 262 3e-32    144
23  nHd.2.3.1.t00019-RA gi|654285791|ref|WP_027800859.1|    55.37   121 53  1   4   124 93  212 4e-32    143

I see no obvious pattern here in terms of the NR database chunk number the hits came from in how the subset from -max_target_seqs 100 is being selected. They both contain hits from chunk 13 for example.

Owner

sujaikumar commented Nov 30, 2015

Issue resolved thanks to prompt reply from @ncbi blast-help@ncbi.nlm.nih.gov

Hello,
Thank you for the report. We don't consider this a bug, but I agree that we should document this possibility better. This can happen because limits, including max target sequences, are applied in an early ungapped phase of the algorithm, as well as later. In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits. In your case, relaxing the limit to 200 appears to have allowed hits that would have been excluded in the ungapped phase at 100 max target sequences to rise.

peterjc commented Nov 30, 2015

I'd agree with the message from the NCBI that this needs to be documented better. However, I still think most users would regard it as a bug rather than a feature, and as you've shown, a "feature" capable of giving some potentially very surprising behaviour. Thanks for raising this Sujai.

So that means max_target_seqs has nothing to do with reporting. It is actually a parameter that modifies the algorithm behavior. This is extremely poor documentation on NCBI's part.

peterjc commented Dec 1, 2015

Sujai kindly CC'd me on his email to the NCBI, so I also got their reply (posted above). I've have replied adding:

Quoting my email: Could I also ask does the same apply to -num_descriptions and -num_alignments, or are those actually applied as a final filter? If the later, this partly explains my confusion and frustration with #5 on http://blastedbio.blogspot.co.uk/2014/12/blast-christmas-wish-list.html

peterjc commented Dec 1, 2015

P.S. Here's the original Tweet from Sujai which also garnered a lot of comments https://twitter.com/sujaik/status/671333856461660160 including confirmation of the surprising behaviour in the NCBI online BLAST server by @drchriscole

harmn commented Dec 3, 2015

The same actually holds true for the e-value theshold (at least in version 2.2.28+), this is from an email from the NCBI User Services:

The cutoffs are applied during an ungapped extension phase, and some
alignments may improve in score during gapped extension. The safest
approach is to relax (increase) the threshold and deal with a larger
result set.

peterjc commented Dec 18, 2015

Thanks @harmn - that's another surprising design choice. I wonder if this was true back in "legacy" BLAST too?

P.S. I put up a short blog post on this bug: http://blastedbio.blogspot.co.uk/2015/12/blast-max-target-sequences-bug.html

tshalev commented Apr 23, 2016

Really interesting - so would it be best to just run everything with default settings and then filter later? I also wonder whether blasting into archival format (11) with only default settings and then using max_target_seqs when formatting to other formats would get you the results you desire...

terrycojones commented Aug 25, 2017

I've just run across this gist - thanks all for the digging and commenting. I saw a really alarming result last night and spent most of today trying to figure out what was happening.

The short summary is that I get a strong hit for a sequence from one database but when I make a slightly bigger database and re-run BLAST the hit does not appear at all. I eventually managed to find the hit in the output by increasing any one of the -num_alignments, -num_descriptions, or -max_target_seqs options (depending on -outfmt).

This seems like a DREADFUL PROBLEM. One of the top hits (the 5th) simply vanishes from the (default) BLAST output just because the database is slightly larger. How many unwitting BLAST users are there out there who are assuming that BLAST always shows them the top results, i.e., that -max_target_seqs and its ilk just affect what is displayed and not the search itself. I've been using BLAST for some years now and am a bit in shock.... how many strong hits have I missed because of this? And what about the wider world of BLAST users? The mind boggles.

Below is my write up.


I have a strange situation. I have two BLAST databases (20170821 and 20170824) that both contain a certain subject sequence LC074724. When I query them both with a given query, Q, in one case (20170824) there is a very strong match and in the other (20170821) no match at all.

I'm trying to figure out why.

BLAST version

$ blastn -version
blastn: 2.6.0+
 Package: blast 2.6.0, build May 11 2017 22:22:40

The subject sequence is the same in both databases

Both databases contain an identical subject sequence:

$ blastdbcmd -entry LC074724 -db 20170824 | md5
fcfbc167c7f81dfd75aad1bf3db6220b

$ blastdbcmd -entry LC074724 -db 20170821 | md5
fcfbc167c7f81dfd75aad1bf3db6220b

Doing a match

There is a good match of Q against LC074724 in 20170824:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170824 -query Q.fasta \
  -task blastn -max_hsps 1 | grep LC074724
Q	LC074724.1	2581

Note that the bitscore of the LC074724 match is 2581.

But there is no match of LC074724 at all in 20170821:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
-task blastn -max_hsps 1 | grep LC074724

The option -outfmt '6 qaccver saccver bitscore' tells BLAST to print the query accession number and version, the subject accession and version, and the bitscore of the match.

What are the best hits for 20170821

So what are the bitscores of the Q matches in 20170821?

$ blastn -outfmt '6 bitscore' -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | sort -nr | head
3301
2616
2590
2590
2563
2554
2547
2544
2540
2540

The bitscore of Q against LC074724 in 20170824 (2581) would have placed it in 5th position in the top 10 matches of Q, but LC074724 is not matched at all in 20170821! The lowest bitscores matched in 20170821 is 2354.

What's the difference between the databases?

Get the ids of the subjects in each database

$ blastdbcmd -entry all -db 20170824 | egrep '^>' | cut -c2- | sort > 20170824.ids
$ blastdbcmd -entry all -db 20170821 | egrep '^>' | cut -c2- | sort > 20170821.ids
$ wc -l 2017082?.ids
    9754 20170821.ids
    9111 20170824.ids
   18865 total

20170821 has 643 sequences that are not in 20170824:

$ comm -23 20170821.ids 20170824.ids | wc -l
     643

20170824 has no sequences that are not in 20170821:

$ comm -13 20170821.ids 20170824.ids | wc -l
       0

and the two have 9111 sequences in common:

$ comm -12 20170821.ids 20170824.ids | wc -l
    9111

These numbers match with the numbers from wc -l above.

So 20170821 is just 20170824 with an extra 643 sequences.

Maybe LC074724 matches, but with a low bit score?

BLAST only shows the top 500 matched subjects by default. Let's ask it to show more:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
  -task blastn -max_hsps 1 -max_target_seqs 10000 > 20170821-10000-matches.txt

And....

$ head 20170821-10000-matches.txt
Q	Q	3301
Q	X234A	2616
Q	CS388973.1	2590
Q	DM059402.1	2590
Q	LC074724.1	2581
Q	KJ843188.1	2576
Q	AB697496.1	2576
Q	AB697499.1	2576
Q	AB697503.1	2576
Q	AB697508.1	2576

Holy shit, there it is! And it's in 5th position, just as it should be (based on the bitscores discussed above).

Just to confirm, taking away the -max_target_seqs option gets no match:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
  -task blastn -max_hsps 1 | grep LC074724
$

HOW ON EARTH CAN THAT BE HAPPENING?

Three confusingly named BLAST options

I've always found the names and descriptions of the following three BLAST command-line options very confusing.

From the Formatting options section of blastn -help:

-num_descriptions <Integer, >=0>
  Number of database sequences to show one-line descriptions for
  Not applicable for outfmt > 4
  Default = `500'
   * Incompatible with:  max_target_seqs

-num_alignments <Integer, >=0>
  Number of database sequences to show alignments for
  Default = `250'
   * Incompatible with:  max_target_seqs

And from the Restrict search or results section:

-max_target_seqs <Integer, >=1>
  Maximum number of aligned sequences to keep
  Not applicable for outfmt <= 4
  Default = `500'
   * Incompatible with:  num_descriptions, num_alignments

From the section names, it seems like only the latter option (-max_target_seqs) would have any effect on the search, and that the former two are just about what is displayed.

But... using either just -num_descriptions or -num_alignments with a big value also gets a match:

$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_descriptions 10000 | grep LC074724
LC074724.1  Hepatitis B virus DNA, complete genome, isolate: p621     2581    0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621
$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_alignments 10000 | grep LC074724
LC074724.1  Hepatitis B virus DNA, complete genome, isolate: p621     2581    0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621

So BLAST is apparently finding the LC074724 match even without the high -max_target_seqs option, but it's not displaying it unless there is a high -num_alignments or -num_descriptions value.

That makes no sense at all. BLAST is finding the match without the -max_target_seqs option and the match has the 5th highest score, so why is it not being displayed?

I don't understand the difference between -num_alignments and -num_descriptions.

Maybe -num_alignments is per query and -num_descriptions is in the overall result set (i.e., across all queries). BLAST seems to use "alignment" to refer to a match between a query and a single subject (and that match may have multiple high-scoring pairs (HSPs), each with its own bit score). But why is there a -num_descriptions option? I can only think that BLAST is allocating memory to hold subject descriptions and this option is used to allocate some fixed storage for the overall search. So, possibly, if the -num_descriptions limit is reached during the search, BLAST gives up because it knows it cannot store more subject descriptions. But only setting -num_alignments also gets me a match, so I don't know.

Just to confirm, when none of the 3 options are given and no -outfmt option is either, the match is not found:

$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | grep LC074724
$

WTF?

I've been using BLAST for the last 4 years or so, and have only today realized that something like this could occur.

It's of course super important.

And despite all the above, I still don't understand what's going on.

@tshalev, you ask "so would it be best to just run everything with default settings and then filter later?"

Unfortunately not. The problem is that with the default settings a very high-scoring match may be prematurely discarded in the ungapped stage. So if you just leave the defaults you'll never see it. You need to pass some increased value for -max_target_seqs (for -outfmt >= 4) or the other options (-outfmt < 4) but there is absolutely no guidance as to what value you should use. It's a terrible situation.

In my case (above) I now find that if I run with -max_target_seqs 506 I get the expected match in the 3rd highest position. If I use -max_target_seqs 505 that match is not present in the output. How can a user know what to set -max_target_seqs to when they don't even know what they might be missing? This problem is so serious it makes BLAST suddenly seem almost useless (or at least terribly unreliable), as you may have false negatives that are actually the very best matches. Maybe the only thing to do is set -max_target_seqs to some extremely high value (like the number of sequences in your database) and then filter the output. That approach is what the NCBI mail is suggesting we do (though presumably with a smaller -max_target_seqs than the entire database size - but how small?). I.e., do not rely on -max_target_seqs to filter things for you because it is applied too early in the BLAST matching procedure.

I hope I don't sound too cranky here. I just can't believe this is the default behavior of BLAST, that the documentation of this option is so poor, that the option's name is really misleading, that the impact is so severe (losing the very best hits!), and that there must be many many thousands of people using BLAST who (understandably) have no inkling or expectation that this would be the default behavior. All this in a program that has been around for so long, has had so much work done on it, etc. And where do people go to to discuss such a thing? Not to an issue tracker, or similar, but to gist.

isaphan commented Nov 2, 2017

@terrycojones thanks for your comment. Did you ever get an answer from blast-help@ncbi.nlm.nih.gov ?

voiDnyx commented Nov 14, 2017

Is this still an issue? I found nothing in the release notes of recent ncbi-blast versions saying something changed. Also is this commonly known today? Because I was totally not expecting this behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment