asad/UniRef90.md

## UniRef90.md

      
    Raw
  

              UniRef90.md
            
          
    Goal


To create UniRef90 protein databases for NCBI blast and Diamond Blast
To create a tab delimited taxid mapping file with two columns : sequenceID\tNCBITaxonID

Steps:

Download the uniref90 xml file first (warning - this is ~15 GB, will take a while)
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref90/uniref90.xml.gz

Make the protein fasta file and the taxlist out of this xml file. Out of 43,405,259 entries in uniref90.xml.gz (dated 2016-07-06), 851,213 have no taxon id, so there's no need to put them in the fasta file (as the purpose of this blast db is to identify the likely taxa of the query sequences).
To make the fasta file and the tab delimited taxid mapping file:
zcat uniref90.xml.gz  | perl -e '
open FA, ">uniref90.fasta";
open TX, ">uniref90.taxlist";
while ($line = <>) {
    if ($line =~ /<entry id="(\S+)"/) {
        $id = $1;
        $tx = 0;
        $sq = "";
        <>;<>;<>; $line = <>;
        if ($line =~ /"common taxon ID" value="(\d+)"/) {
            $tx = $1;
        }
    }
    if ($line =~ /<sequence/) {
        while ($line = <>) {
            chomp $line;
            last if $line =~ /<\/sequence/;
            $sq .= $line;
        }
        if ($tx and $sq) {
            print FA ">$id\n$sq\n";
            print TX "$id\t$tx\n";
        }
    }
}'

Now you can make the NCBI Blast db like this:
makeblastdb -dbtype prot -in uniref90.fasta

And the Diamond blast db like this:
diamond makedb --in uniref90.fasta --db uniref90

Using the taxon ID list

If in future you want to append the NCBI taxid to a blast tabular output file with uniref90 as the database (whether from ncbi-blast or diamond-blast) you can use this:
# assuming the UniRef90 id is in the second column (typical for ncbi and diamond blast tabular output)
 
diamond view -a blastoutput.daa \
| perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
> blastoutput.daa.taxid
 
# or
 
perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
< ncbiblast.out.txt \
> ncbiblast.out.txt.taxid