NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.
The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.
At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)
To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins
Enter the following FASTA sequence in the query text box:
>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD
Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:
- The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.
If you now change the Max target sequences to 500, and rerun, you see:
- The top hit is Bacteria (Burkholderia kururiensis 2e-40)
This is reproducible on the command line (using the versions of blastp mentioned above):
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt
Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/
Screenshots attached:
@tshalev, you ask "so would it be best to just run everything with default settings and then filter later?"
Unfortunately not. The problem is that with the default settings a very high-scoring match may be prematurely discarded in the ungapped stage. So if you just leave the defaults you'll never see it. You need to pass some increased value for
-max_target_seqs
(for-outfmt
>= 4) or the other options (-outfmt
< 4) but there is absolutely no guidance as to what value you should use. It's a terrible situation.In my case (above) I now find that if I run with
-max_target_seqs 506
I get the expected match in the 3rd highest position. If I use-max_target_seqs 505
that match is not present in the output. How can a user know what to set-max_target_seqs
to when they don't even know what they might be missing? This problem is so serious it makes BLAST suddenly seem almost useless (or at least terribly unreliable), as you may have false negatives that are actually the very best matches. Maybe the only thing to do is set-max_target_seqs
to some extremely high value (like the number of sequences in your database) and then filter the output. That approach is what the NCBI mail is suggesting we do (though presumably with a smaller-max_target_seqs
than the entire database size - but how small?). I.e., do not rely on-max_target_seqs
to filter things for you because it is applied too early in the BLAST matching procedure.I hope I don't sound too cranky here. I just can't believe this is the default behavior of BLAST, that the documentation of this option is so poor, that the option's name is really misleading, that the impact is so severe (losing the very best hits!), and that there must be many many thousands of people using BLAST who (understandably) have no inkling or expectation that this would be the default behavior. All this in a program that has been around for so long, has had so much work done on it, etc. And where do people go to to discuss such a thing? Not to an issue tracker, or similar, but to gist.