NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.
The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.
At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)
To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins
Enter the following FASTA sequence in the query text box:
>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD
Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:
- The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.
If you now change the Max target sequences to 500, and rerun, you see:
- The top hit is Bacteria (Burkholderia kururiensis 2e-40)
This is reproducible on the command line (using the versions of blastp mentioned above):
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt
Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/
Screenshots attached:
Thanks for testing and reproducing, Peter.
Nick Loman points out that this might be intended behaviour (https://twitter.com/pathogenomenick/status/671351106866438145) but it doesn't seem to be based on documentation at http://www.ncbi.nlm.nih.gov/books/NBK279682/ . Although it doesn't explicitly say that the best hits will always be returned on that page, am sure I've seen it somewhere (can someone find out where) that the results are always returned sorted by best e-value.