Skip to content

Instantly share code, notes, and snippets.

@ivan-krukov
Last active April 27, 2016 21:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ivan-krukov/626d2336c9a42d84a57bf127cc7a14bd to your computer and use it in GitHub Desktop.
Save ivan-krukov/626d2336c9a42d84a57bf127cc7a14bd to your computer and use it in GitHub Desktop.
Get ENSEMBL IDs for a given KEGG pathway

Problem

For a given KEGG pathway, we want to get a list of all the genes. Ensembl IDs are convenient here.

KEGG provides a REST API for some tasks, but is far from complete. For example, it is possible to map from KEGG to NCBI IDs, but not to Ensembl IDs.

The implementation peforms the following steps:

  1. GET pathway mapping: (e.g. http://rest.kegg.jp/link/genes/hsa04115)
path:hsa04115	hsa:1017
path:hsa04115	hsa:1019
path:hsa04115	hsa:1021
...
  1. For each gene ID, GET gene entry (e.g. http://rest.kegg.jp/get/hsa:54205)
...
DBLINKS     NCBI-ProteinID: NP_001777
            NCBI-GeneID: 983
            OMIM: 116940
            HGNC: 1722
            HPRD: 00302
            Ensembl: ENSG00000170312
            Vega: OTTHUMG00000018290
            UniProt: P06493 I6L9I5
...

For the lack of a better API, the data is extracted with regular expressions.

from requests import get
from tqdm import tqdm
import re
def external_ids_for_kegg_pathway(pathway, organism = "hsa", external_db = "Ensembl", verbose = True):
header = "http://rest.kegg.jp"
id_mapping = {}
# Get the list of genes
gene_list_request = get(header + "/link/genes/" + organism + pathway)
if gene_list_request.ok:
gene_list = re.findall("(" + organism + ":.*)", gene_list_request.text)
if verbose:
print("Will retrieve", len(gene_list), "entries...")
iterator = tqdm(gene_list) if verbose else gene_list
for gene_id in iterator:
# Get record for this gene_id
gene_entry_request = get(header + "/get/" + gene_id)
if gene_entry_request.ok:
external_id = re.findall(external_db + ": (.*)", gene_entry_request.text)
id_mapping[gene_id] = tuple(external_id)
return id_mapping
def test_ensembl_p53():
ensembl_ids = external_ids_for_kegg_pathway("04115")
assert len(ensembl_ids.keys()) == 69
assert ("ENSG00000141510",) in ensembl_ids.values()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment