NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.
The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.
At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)
To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins
Enter the following FASTA sequence in the query text box:
>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD
Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:
- The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.
If you now change the Max target sequences to 500, and rerun, you see:
- The top hit is Bacteria (Burkholderia kururiensis 2e-40)
This is reproducible on the command line (using the versions of blastp mentioned above):
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt
Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/
Screenshots attached:
I've just run across this gist - thanks all for the digging and commenting. I saw a really alarming result last night and spent most of today trying to figure out what was happening.
The short summary is that I get a strong hit for a sequence from one database but when I make a slightly bigger database and re-run BLAST the hit does not appear at all. I eventually managed to find the hit in the output by increasing any one of the
-num_alignments
,-num_descriptions
, or-max_target_seqs
options (depending on-outfmt
).This seems like a DREADFUL PROBLEM. One of the top hits (the 5th) simply vanishes from the (default) BLAST output just because the database is slightly larger. How many unwitting BLAST users are there out there who are assuming that BLAST always shows them the top results, i.e., that
-max_target_seqs
and its ilk just affect what is displayed and not the search itself. I've been using BLAST for some years now and am a bit in shock.... how many strong hits have I missed because of this? And what about the wider world of BLAST users? The mind boggles.Below is my write up.
I have a strange situation. I have two BLAST databases (
20170821
and20170824
) that both contain a certain subject sequenceLC074724
. When I query them both with a given query,Q
, in one case (20170824
) there is a very strong match and in the other (20170821
) no match at all.I'm trying to figure out why.
BLAST version
The subject sequence is the same in both databases
Both databases contain an identical subject sequence:
Doing a match
There is a good match of
Q
againstLC074724
in20170824
:Note that the bitscore of the
LC074724
match is 2581.But there is no match of
LC074724
at all in20170821
:The option
-outfmt '6 qaccver saccver bitscore'
tells BLAST to print the query accession number and version, the subject accession and version, and the bitscore of the match.What are the best hits for 20170821
So what are the bitscores of the
Q
matches in20170821
?The bitscore of
Q
againstLC074724
in20170824
(2581) would have placed it in 5th position in the top 10 matches ofQ
, butLC074724
is not matched at all in20170821
! The lowest bitscores matched in20170821
is 2354.What's the difference between the databases?
Get the ids of the subjects in each database
$ wc -l 2017082?.ids 9754 20170821.ids 9111 20170824.ids 18865 total
20170821
has 643 sequences that are not in20170824
:$ comm -23 20170821.ids 20170824.ids | wc -l 643
20170824
has no sequences that are not in20170821
:$ comm -13 20170821.ids 20170824.ids | wc -l 0
and the two have 9111 sequences in common:
$ comm -12 20170821.ids 20170824.ids | wc -l 9111
These numbers match with the numbers from
wc -l
above.So
20170821
is just20170824
with an extra 643 sequences.Maybe LC074724 matches, but with a low bit score?
BLAST only shows the top 500 matched subjects by default. Let's ask it to show more:
And....
Holy shit, there it is! And it's in 5th position, just as it should be (based on the bitscores discussed above).
Just to confirm, taking away the
-max_target_seqs
option gets no match:HOW ON EARTH CAN THAT BE HAPPENING?
Three confusingly named BLAST options
I've always found the names and descriptions of the following three BLAST command-line options very confusing.
From the Formatting options section of
blastn -help
:And from the Restrict search or results section:
From the section names, it seems like only the latter option (
-max_target_seqs
) would have any effect on the search, and that the former two are just about what is displayed.But... using either just
-num_descriptions
or-num_alignments
with a big value also gets a match:So BLAST is apparently finding the
LC074724
match even without the high-max_target_seqs
option, but it's not displaying it unless there is a high-num_alignments
or-num_descriptions
value.That makes no sense at all. BLAST is finding the match without the
-max_target_seqs
option and the match has the 5th highest score, so why is it not being displayed?I don't understand the difference between
-num_alignments
and-num_descriptions
.Maybe
-num_alignments
is per query and-num_descriptions
is in the overall result set (i.e., across all queries). BLAST seems to use "alignment" to refer to a match between a query and a single subject (and that match may have multiple high-scoring pairs (HSPs), each with its own bit score). But why is there a-num_descriptions
option? I can only think that BLAST is allocating memory to hold subject descriptions and this option is used to allocate some fixed storage for the overall search. So, possibly, if the-num_descriptions
limit is reached during the search, BLAST gives up because it knows it cannot store more subject descriptions. But only setting-num_alignments
also gets me a match, so I don't know.Just to confirm, when none of the 3 options are given and no
-outfmt
option is either, the match is not found:$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | grep LC074724 $
WTF?
I've been using BLAST for the last 4 years or so, and have only today realized that something like this could occur.
It's of course super important.
And despite all the above, I still don't understand what's going on.