Skip to content

Instantly share code, notes, and snippets.

@sujaikumar
Last active January 29, 2024 08:14
Show Gist options
  • Save sujaikumar/9ad04e62449a2d7025b17144de67038b to your computer and use it in GitHub Desktop.
Save sujaikumar/9ad04e62449a2d7025b17144de67038b to your computer and use it in GitHub Desktop.
UniRef90 protein blast database with taxon IDs

Goal

  • To create UniRef90 protein databases for NCBI blast and Diamond Blast
  • To create a tab delimited taxid mapping file with two columns : sequenceID\tNCBITaxonID

Steps:

Download the uniref90 xml file first (warning - this is ~15 GB, will take a while)

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref90/uniref90.xml.gz

Make the protein fasta file and the taxlist out of this xml file. Out of 43,405,259 entries in uniref90.xml.gz (dated 2016-07-06), 851,213 have no taxon id, so there's no need to put them in the fasta file (as the purpose of this blast db is to identify the likely taxa of the query sequences).

To make the fasta file and the tab delimited taxid mapping file:

zcat uniref90.xml.gz  | perl -e '
open FA, ">uniref90.fasta";
open TX, ">uniref90.taxlist";
while ($line = <>) {
    if ($line =~ /<entry id="(\S+)"/) {
        $id = $1;
        $tx = 0;
        $sq = "";
        <>;<>;<>; $line = <>;
        if ($line =~ /"common taxon ID" value="(\d+)"/) {
            $tx = $1;
        }
    }
    if ($line =~ /<sequence/) {
        while ($line = <>) {
            chomp $line;
            last if $line =~ /<\/sequence/;
            $sq .= $line;
        }
        if ($tx and $sq) {
            print FA ">$id\n$sq\n";
            print TX "$id\t$tx\n";
        }
    }
}'

Now you can make the NCBI Blast db like this:

makeblastdb -dbtype prot -in uniref90.fasta

And the Diamond blast db like this:

diamond makedb --in uniref90.fasta --db uniref90

Using the taxon ID list

If in future you want to append the NCBI taxid to a blast tabular output file with uniref90 as the database (whether from ncbi-blast or diamond-blast) you can use this:

# assuming the UniRef90 id is in the second column (typical for ncbi and diamond blast tabular output)
 
diamond view -a blastoutput.daa \
| perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
> blastoutput.daa.taxid
 
# or
 
perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
< ncbiblast.out.txt \
> ncbiblast.out.txt.taxid
@EorgeKit
Copy link

EorgeKit commented Oct 4, 2022 via email

@philippbayer
Copy link

philippbayer commented Apr 12, 2023

Hi!

Here's how I did a similar thing in 2023 - I use taxonomy-specific subsets of uniprot, not uniref90 (see below why)

First, I am downloading the database in RDF format using the aws s3 client https://registry.opendata.aws/uniprot/

aws s3 cp --no-sign-request s3://aws-open-data-uniprot-rdf/2023-01/uniprot/ . --recursive  --exclude "*" --include "uniprotkb_reviewed_eukaryota_opisthokonta_metazoa*"

aws s3 cp --no-sign-request s3://aws-open-data-uniprot-rdf/2023-01/uniprot/ . --recursive  --exclude "*" --include "uniprotkb_unreviewed_eukaryota_opisthokonta_metazoa*"

Currently, the single 'reviewed' file is 7GB while the 36 'unreviewed' files are about 20GB in size each.

The upside of using aws is that it's far faster than the uniprot FTP; here in Western Australia aws downloads with 50 MB/s, while wget from the uniprot FTP runs with 50 KB/s. The download takes minutes instead of days.
The downside is that the AWS mirror has only files in RDF format, a XML-like format that is usually queried using a separate language, see the example here. I also can't find the 'clustered' uniref50/90 on AWS, only the entire pre-clustered uniref datasets.

But we can also use grep and a bit of sed to pull out the fasta records from the RDF (after gunzip):

grep -B 10 '<rdf:value' uniprotkb_reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf | grep -e '<rdf:Description' -e '<rdf:value' | grep -v '"#_kb' | sed 's,<rdf:value>,,' | sed 's,</rdf:value>,,' | sed 's,<rdf:Description rdf:about="http://purl.uniprot.org/isoforms/,>', | sed 's,">,,'  | grep -v '<rdf:Description rdf:about="#_' | grep -v '^<' > uniprot_reviewed.fasta

Here's how to get the taxonomy file:

grep -e '<organism' -e '<sequence' uniprotkb_reviewed_eukaryota_opisthokonta_metazoa_33208_0.rdf | sed 's,<organism rdf:resource="http://purl.uniprot.org/taxonomy/,,' | sed 's,<sequence rdf:resource="http://purl.uniprot.org/isoforms/,,' | cut -f  1 -d ' ' | sed 's,"/>,,' | sed 's,",,' | sed '$!N;s/\n/ /' | awk '{print $2 " " $1}' > uniprot_reviewed.taxlist

I'm sure there are less fragile ways but this one works so far for me :)

Edit: there's currently a bug in there, as in that secondary protein models don't receive the correct taxonomy ID. That's because the taxonomy ID is only listed once for the primary protein model. For now I'm just removing the secondary, tertiary etc. sequence by keeping only sequences where the ID ends with '-1'.

@sujaikumar
Copy link
Author

Thanks @philippbayer - great tip to use AWS!

@kdm9
Copy link

kdm9 commented Apr 18, 2023

@philippbayer et al

aria2c --file-allocation=none --max-connection-per-server=16 -s 16 https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz helped a lot with the download speed from the main uniprot ftp server. The whole 30G file downloaded in ~1 hour instead of >2 days for me.

I also have some python code to make the required accession2taxid.tsv mapping, which handles some quirks of how Diamond parses the fasta file:

pip install "blindschleiche>=0.1.9"
zcat uniref90.fasta.gz | blsl uniref-acc2taxid -s >uniref90.acc2taxid.tsv
# -s option above removes the UniRefXX_ prefixes, required for Diamond, see https://github.com/bbuchfink/diamond/issues/639
# for Ye Olde blast, don't use -s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment