Skip to content

Instantly share code, notes, and snippets.

@sujaikumar
Last active January 29, 2024 08:14
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save sujaikumar/9ad04e62449a2d7025b17144de67038b to your computer and use it in GitHub Desktop.
Save sujaikumar/9ad04e62449a2d7025b17144de67038b to your computer and use it in GitHub Desktop.
UniRef90 protein blast database with taxon IDs

Goal

  • To create UniRef90 protein databases for NCBI blast and Diamond Blast
  • To create a tab delimited taxid mapping file with two columns : sequenceID\tNCBITaxonID

Steps:

Download the uniref90 xml file first (warning - this is ~15 GB, will take a while)

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref90/uniref90.xml.gz

Make the protein fasta file and the taxlist out of this xml file. Out of 43,405,259 entries in uniref90.xml.gz (dated 2016-07-06), 851,213 have no taxon id, so there's no need to put them in the fasta file (as the purpose of this blast db is to identify the likely taxa of the query sequences).

To make the fasta file and the tab delimited taxid mapping file:

zcat uniref90.xml.gz  | perl -e '
open FA, ">uniref90.fasta";
open TX, ">uniref90.taxlist";
while ($line = <>) {
    if ($line =~ /<entry id="(\S+)"/) {
        $id = $1;
        $tx = 0;
        $sq = "";
        <>;<>;<>; $line = <>;
        if ($line =~ /"common taxon ID" value="(\d+)"/) {
            $tx = $1;
        }
    }
    if ($line =~ /<sequence/) {
        while ($line = <>) {
            chomp $line;
            last if $line =~ /<\/sequence/;
            $sq .= $line;
        }
        if ($tx and $sq) {
            print FA ">$id\n$sq\n";
            print TX "$id\t$tx\n";
        }
    }
}'

Now you can make the NCBI Blast db like this:

makeblastdb -dbtype prot -in uniref90.fasta

And the Diamond blast db like this:

diamond makedb --in uniref90.fasta --db uniref90

Using the taxon ID list

If in future you want to append the NCBI taxid to a blast tabular output file with uniref90 as the database (whether from ncbi-blast or diamond-blast) you can use this:

# assuming the UniRef90 id is in the second column (typical for ncbi and diamond blast tabular output)
 
diamond view -a blastoutput.daa \
| perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
> blastoutput.daa.taxid
 
# or
 
perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
< ncbiblast.out.txt \
> ncbiblast.out.txt.taxid
@TelfordLab
Copy link

TelfordLab commented Dec 19, 2016

Another option to get the taxIDs from the fasta file without going through the xml:
In pyhton/ipython

import gzip
from Bio import SeqIO


taxdmp={}
with open("names.dmp", "r") as dmpfile:
    for line in dmpfile:
        taxdmp[(str(line.split("\t")[2]))] = str(line.split("\t")[0])


with gzip.open("uniref90.fasta.gz", "rt") as uniprot, open("uniref90.taxlist", "w") as outfile:
    for  record in SeqIO.parse(uniprot, "fasta"):
        if "Tax=" in str(record.description):
            outfile.write(str(record.id + "\t" +str(taxdmp.get(record.description.split("Tax=")[1].split("RepID=")[0].rstrip()))+ "\n"))

@EorgeKit
Copy link

EorgeKit commented Oct 2, 2022

Hello @sujaikumar

I managed to run the second command but when I run the building diamond database command I get the following error:

diamond v0.9.24.125 | by Benjamin Buchfink <buchfink@gmail.com>
Licensed under the GNU GPL <https://www.gnu.org/licenses/gpl.txt>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 32
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database file: uniref90.fasta
Opening the database file...  [0.002896s]
Loading sequences...  [0.000477s]
Error: Error reading input stream at line 2: Invalid character (<) in sequence

Upon inspecting the unire90.fasa file, this is how it looks .

>UniRef90_A0A5A9P0L4
</representativeMember></entry><entry id="UniRef90_A0A410P257" updated="2022-08-03"><name>Cluster: Glycogen synthase</name><property type="member count" value="2"/><property type="common taxon" value="Candidatus Velamenicoccus archaeovorus"/><property type="common taxon ID" value="1930593"/><representativeMember><dbReference type="UniProtKB ID" id="A0A410P257_9BACT"><property type="UniProtKB accession" value="A0A410P257"/><property type="UniParc ID" value="UPI000FFCD83E"/><property type="UniRef100 ID" value="UniRef100_A0A410P257"/><property type="UniRef50 ID" value="UniRef50_A0A410P257"/><property type="protein name" value="Glycogen synthase"/><property type="source organism" value="Candidatus Velamenicoccus archaeovorus"/><property type="NCBI taxonomy" value="1930593"/><property type="length" value="39677"/></dbReference>
>UniRef90_A0A8J3NBY6
</representativeMember><member><dbReference type="UniParc ID" id="UPI0019403D63"><property type="UniRef100 ID" value="UniRef100_A0A8J3NBY6"/><property type="protein name" value="DUF6528 family protein"/><property type="source organism" value="Actinocatenispora rupis"/><property type="NCBI taxonomy" value="519421"/><property type="length" value="39571"/><property type="isSeed" value="true"/></dbReference></member></entry><entry id="UniRef90_A0A6B0RPA5" updated="2020-06-17"><name>Cluster: Uncharacterized protein</name><property type="member count" value="6"/><property type="common taxon" value="Pecora"/><property type="common taxon ID" value="35500"/><property type="GO Molecular Function" value="GO:0005524"/><property type="GO Molecular Function" value="GO:0004672"/><property type="GO Biological Process" value="GO:0006468"/><representativeMember><dbReference type="UniProtKB ID" id="A0A6B0RPA5_9CETA"><property type="UniProtKB accession" value="A0A6B0RPA5"/><property type="UniParc ID" value="UPI001256766E"/><property type="UniRef100 ID" value="UniRef100_A0A6B0RPA5"/><property type="UniRef50 ID" value="UniRef50_Q8WZ42"/><property type="protein name" value="Uncharacterized protein"/><property type="source organism" value="Bos mutus (wild yak)"/><property type="NCBI taxonomy" value="72004"/><property type="length" value="38354"/><property type="isSeed" value="true"/></dbReference>
>UniRef90_A0A401TRQ8
</representativeMember><member><dbReference type="UniParc ID" id="UPI001CB855D4"><property type="UniRef100 ID" value="UniRef100_A0A401TRQ8"/><property type="protein name" value="titin"/><property type="source organism" value="Chiloscyllium plagiosum"/><property type="NCBI taxonomy" value="36176"/><property type="length" value="38279"/><property type="isSeed" value="true"/></dbReference></member></entry><entry id="UniRef90_A0A672ZWI7" updated="2022-08-03"><name>Cluster: Ig-like domain-containing protein</name><property type="member count" value="2"/><property type="common taxon" value="Sphaeramia orbicularis"/><property type="common taxon ID" value="375764"/><representativeMember><dbReference type="UniProtKB ID" id="A0A672ZWI7_9TELE"><property type="UniProtKB accession" value="A0A672ZWI7"/><property type="UniParc ID" value="UPI00133ED077"/><property type="UniRef100 ID" value="UniRef100_A0A672ZWI7"/><property type="UniRef50 ID" value="UniRef50_A0A6J2WDG0"/><property type="protein name" value="Ig-like domain-containing protein"/><property type="source organism" value="Sphaeramia orbicularis (orbiculate cardinalfish)"/><property type="NCBI taxonomy" value="375764"/><property type="length" value="189"/></dbReference>
>UniRef90_A0A6P7YNV3
</representativeMember></entry><entry id="UniRef90_A0A4U5TZD8" updated="2019-07-31"><name>Cluster: Titin</name><property type="member count" value="1"/><property type="common taxon" value="Collichthys lucidus"/><property type="common taxon ID" value="240159"/><property type="GO Molecular Function" value="GO:0005524"/><property type="GO Molecular Function" value="GO:0004672"/><property type="GO Biological Process" value="GO:0006468"/><property type="GO Cellular Component" value="GO:0016021"/><property type="GO Cellular Component" value="GO:0043229"/><representativeMember><dbReference type="UniProtKB ID" id="A0A4U5TZD8_COLLU"><property type="UniProtKB accession" value="A0A4U5TZD8"/><property type="UniParc ID" value="UPI0010C829F5"/><property type="UniRef100 ID" value="UniRef100_A0A4U5TZD8"/><property type="UniRef50 ID" value="UniRef50_A0A4U5TZD8"/><property type="protein name" value="Titin"/><property type="source organism" value="Collichthys lucidus (Big head croaker) (Sciaena lucida)"/><property type="NCBI taxonomy" value="240159"/><property type="length" value="38162"/><property type="isSeed" value="true"/></dbReference>
>UniRef90_UPI00074FFD9C
</representativeMember></entry><entry id="UniRef90_A0A6P8RG40" updated="2020-12-02"><name>Cluster: titin isoform X1</name><property type="member count" value="3"/><property type="common taxon" value="Geotrypetes seraphini"/><property type="common taxon ID" value="260995"/><property type="GO Molecular Function" value="GO:0005524"/><property type="GO Molecular Function" value="GO:0004672"/><property type="GO Biological Process" value="GO:0006468"/><property type="GO Cellular Component" value="GO:0043229"/><property type="GO Cellular Component" value="GO:0005886"/><representativeMember><dbReference type="UniProtKB ID" id="A0A6P8RG40_GEOSA"><property type="UniProtKB accession" value="A0A6P8RG40"/><property type="UniParc ID" value="UPI001457F9B1"/><property type="UniRef100 ID" value="UniRef100_A0A6P8RG40"/><property type="UniRef50 ID" value="UniRef50_Q8WZ42"/><property type="protein name" value="titin isoform X1"/><property type="source organism" value="Geotrypetes seraphini (Gaboon caecilian) (Caecilia seraphini)"/><property type="NCBI taxonomy" value="260995"/><property type="length" value="38067"/><property type="isSeed" value="true"/></dbReference>

That is for just the first five entrie.Is that how the fasta file should appear or there might be something wrong?

@sujaikumar
Copy link
Author

That is for just the first five entrie.Is that how the fasta file should appear or there might be something wrong?

The fasta file is definitely wrong, unfortunately.

What has happened is that the xml file has changed in format from 2016, so my perl command no longer works! Sorry about that. I will try and update it when I have time. But for now this method won't work

@EorgeKit
Copy link

EorgeKit commented Oct 4, 2022 via email

@philippbayer
Copy link

philippbayer commented Apr 12, 2023

Hi!

Here's how I did a similar thing in 2023 - I use taxonomy-specific subsets of uniprot, not uniref90 (see below why)

First, I am downloading the database in RDF format using the aws s3 client https://registry.opendata.aws/uniprot/

aws s3 cp --no-sign-request s3://aws-open-data-uniprot-rdf/2023-01/uniprot/ . --recursive  --exclude "*" --include "uniprotkb_reviewed_eukaryota_opisthokonta_metazoa*"

aws s3 cp --no-sign-request s3://aws-open-data-uniprot-rdf/2023-01/uniprot/ . --recursive  --exclude "*" --include "uniprotkb_unreviewed_eukaryota_opisthokonta_metazoa*"

Currently, the single 'reviewed' file is 7GB while the 36 'unreviewed' files are about 20GB in size each.

The upside of using aws is that it's far faster than the uniprot FTP; here in Western Australia aws downloads with 50 MB/s, while wget from the uniprot FTP runs with 50 KB/s. The download takes minutes instead of days.
The downside is that the AWS mirror has only files in RDF format, a XML-like format that is usually queried using a separate language, see the example here. I also can't find the 'clustered' uniref50/90 on AWS, only the entire pre-clustered uniref datasets.

But we can also use grep and a bit of sed to pull out the fasta records from the RDF (after gunzip):

grep -B 10 '<rdf:value' uniprotkb_reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf | grep -e '<rdf:Description' -e '<rdf:value' | grep -v '"#_kb' | sed 's,<rdf:value>,,' | sed 's,</rdf:value>,,' | sed 's,<rdf:Description rdf:about="http://purl.uniprot.org/isoforms/,>', | sed 's,">,,'  | grep -v '<rdf:Description rdf:about="#_' | grep -v '^<' > uniprot_reviewed.fasta

Here's how to get the taxonomy file:

grep -e '<organism' -e '<sequence' uniprotkb_reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf | sed 's,<organism rdf:resource="http://purl.uniprot.org/taxonomy/,,' | sed 's,<sequence rdf:resource="http://purl.uniprot.org/isoforms/,,' | cut -f  1 -d ' ' | sed 's,"/>,,' | sed 's,",,' | sed '$!N;s/\n/ /' | awk '{print $2 " " $1}' > uniprot_reviewed.taxlist

I'm sure there are less fragile ways but this one works so far for me :)

Edit: there's currently a bug in there, as in that secondary protein models don't receive the correct taxonomy ID. That's because the taxonomy ID is only listed once for the primary protein model. For now I'm just removing the secondary, tertiary etc. sequence by keeping only sequences where the ID ends with '-1'.

@sujaikumar
Copy link
Author

Thanks @philippbayer - great tip to use AWS!

@kdm9
Copy link

kdm9 commented Apr 18, 2023

@philippbayer et al

aria2c --file-allocation=none --max-connection-per-server=16 -s 16 https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz helped a lot with the download speed from the main uniprot ftp server. The whole 30G file downloaded in ~1 hour instead of >2 days for me.

I also have some python code to make the required accession2taxid.tsv mapping, which handles some quirks of how Diamond parses the fasta file:

pip install "blindschleiche>=0.1.9"
zcat uniref90.fasta.gz | blsl uniref-acc2taxid -s >uniref90.acc2taxid.tsv
# -s option above removes the UniRefXX_ prefixes, required for Diamond, see https://github.com/bbuchfink/diamond/issues/639
# for Ye Olde blast, don't use -s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment