Skip to content

Instantly share code, notes, and snippets.

@sujaikumar
Last active January 29, 2024 08:14
Show Gist options
  • Save sujaikumar/9ad04e62449a2d7025b17144de67038b to your computer and use it in GitHub Desktop.
Save sujaikumar/9ad04e62449a2d7025b17144de67038b to your computer and use it in GitHub Desktop.
UniRef90 protein blast database with taxon IDs

Goal

  • To create UniRef90 protein databases for NCBI blast and Diamond Blast
  • To create a tab delimited taxid mapping file with two columns : sequenceID\tNCBITaxonID

Steps:

Download the uniref90 xml file first (warning - this is ~15 GB, will take a while)

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref90/uniref90.xml.gz

Make the protein fasta file and the taxlist out of this xml file. Out of 43,405,259 entries in uniref90.xml.gz (dated 2016-07-06), 851,213 have no taxon id, so there's no need to put them in the fasta file (as the purpose of this blast db is to identify the likely taxa of the query sequences).

To make the fasta file and the tab delimited taxid mapping file:

zcat uniref90.xml.gz  | perl -e '
open FA, ">uniref90.fasta";
open TX, ">uniref90.taxlist";
while ($line = <>) {
    if ($line =~ /<entry id="(\S+)"/) {
        $id = $1;
        $tx = 0;
        $sq = "";
        <>;<>;<>; $line = <>;
        if ($line =~ /"common taxon ID" value="(\d+)"/) {
            $tx = $1;
        }
    }
    if ($line =~ /<sequence/) {
        while ($line = <>) {
            chomp $line;
            last if $line =~ /<\/sequence/;
            $sq .= $line;
        }
        if ($tx and $sq) {
            print FA ">$id\n$sq\n";
            print TX "$id\t$tx\n";
        }
    }
}'

Now you can make the NCBI Blast db like this:

makeblastdb -dbtype prot -in uniref90.fasta

And the Diamond blast db like this:

diamond makedb --in uniref90.fasta --db uniref90

Using the taxon ID list

If in future you want to append the NCBI taxid to a blast tabular output file with uniref90 as the database (whether from ncbi-blast or diamond-blast) you can use this:

# assuming the UniRef90 id is in the second column (typical for ncbi and diamond blast tabular output)
 
diamond view -a blastoutput.daa \
| perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
> blastoutput.daa.taxid
 
# or
 
perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
< ncbiblast.out.txt \
> ncbiblast.out.txt.taxid
@sujaikumar
Copy link
Author

Thanks @philippbayer - great tip to use AWS!

@kdm9
Copy link

kdm9 commented Apr 18, 2023

@philippbayer et al

aria2c --file-allocation=none --max-connection-per-server=16 -s 16 https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz helped a lot with the download speed from the main uniprot ftp server. The whole 30G file downloaded in ~1 hour instead of >2 days for me.

I also have some python code to make the required accession2taxid.tsv mapping, which handles some quirks of how Diamond parses the fasta file:

pip install "blindschleiche>=0.1.9"
zcat uniref90.fasta.gz | blsl uniref-acc2taxid -s >uniref90.acc2taxid.tsv
# -s option above removes the UniRefXX_ prefixes, required for Diamond, see https://github.com/bbuchfink/diamond/issues/639
# for Ye Olde blast, don't use -s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment