Skip to content

Instantly share code, notes, and snippets.

@Ken-Kuroki
Last active December 7, 2018 09:48
Show Gist options
  • Save Ken-Kuroki/a8e84cbfe889b56b9154d3523fbab98d to your computer and use it in GitHub Desktop.
Save Ken-Kuroki/a8e84cbfe889b56b9154d3523fbab98d to your computer and use it in GitHub Desktop.
Get Taxonomy Hierarchy using ETE Toolkit 3 and Apply to GenBank Assembly Summary
# This is roughly 30 fold faster than my original implimentation.
# Install ETE3 via pip and run ncbi.update_taxonomy_database() first.
from collections import defaultdict
import pandas as pd
from ete3 import NCBITaxa
ncbi = NCBITaxa()
def get_taxonomy_hierarchy(taxid):
# Note that the resulting dictionary isn't ordered by the hierarchy.
# Since there are missing levels for some taxids, the dict is converted to defaultdict,
# which can be an unnecessary step in many cases.
names = ncbi.get_taxid_translator(ncbi.get_lineage(taxid))
ranks = ncbi.get_rank(ncbi.get_lineage(taxid))
return defaultdict(lambda: "", {ranks[k]: names[k] for k in names.keys()})
# Let's assign taxonomic classifications for GenBank entries.
df = pd.read_csv("assembly_summary.txt", skiprows=1, sep="\t")
df["tax_hierarchy"] = df["species_taxid"].apply(lambda x: get_taxonomy_hierarchy(x))
for rank in ["species", "genus", "family", "order", "class", "phylum"]:
df[rank] = df["tax_hierarchy"].apply(lambda x: x[rank])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment