Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Convert ENSEMBL stable identifiers to gene symbols
import biomart
def get_ensembl_mappings():
# Set up connection to server
server = biomart.BiomartServer('http://uswest.ensembl.org/biomart')
mart = server.datasets['mmusculus_gene_ensembl']
# List the types of data we want
attributes = ['ensembl_transcript_id', 'mgi_symbol',
'ensembl_gene_id', 'ensembl_peptide_id']
# Get the mapping between the attributes
response = mart.search({'attributes': attributes})
data = response.raw.data.decode('ascii')
ensembl_to_genesymbol = {}
# Store the data in a dict
for line in data.splitlines():
line = line.split('\t')
# The entries are in the same order as in the `attributes` variable
transcript_id = line[0]
gene_symbol = line[1]
ensembl_gene = line[2]
ensembl_peptide = line[3]
# Some of these keys may be an empty string. If you want, you can
# avoid having a '' key in your dict by ensuring the
# transcript/gene/peptide ids have a nonzero length before
# adding them to the dict
ensembl_to_genesymbol[transcript_id] = gene_symbol
ensembl_to_genesymbol[ensembl_gene] = gene_symbol
ensembl_to_genesymbol[ensembl_peptide] = gene_symbol
return ensembl_to_genesymbol
@ben-heil
Copy link
Author

ben-heil commented Jun 24, 2022

Hi Victor, since get_ensembl_mappings returns a dict, I think the correct pandas function to use would be map. Without having looked at your data, I think df['gene_id']=df['gene_id'].map(get_ensembl_mappings()) should work (just be careful to handle the NaN values for the ids that don't map).

@victorsanchezarevalo
Copy link

victorsanchezarevalo commented Jun 24, 2022

Thanks a lot!
It works
Vic

@ben-heil
Copy link
Author

ben-heil commented Jun 24, 2022

Glad to hear it!

@victorsanchezarevalo
Copy link

victorsanchezarevalo commented Jul 29, 2022

Hi Ben,
I would need to convert Entrez ID mouse into gene symbol mouse. Could I use this function changing parameters?
Best
Vic

@ben-heil
Copy link
Author

ben-heil commented Jul 29, 2022

I haven't tried it, but it should be possible! The most straightforward way to do so would be to add 'entrezgene_id' to the end of the attributes list and convert the ensembl_to_genesymbol lines to map entrez to genesymbol e.g.

entrez_id = line[4] 
entrez_to_genesymbol[entrez_id] = gene_symbol

More information on the available attributes can be found here: https://bioconductor.riken.jp/packages/3.4/bioc/vignettes/biomaRt/inst/doc/biomaRt.html

@victorsanchezarevalo
Copy link

victorsanchezarevalo commented Jul 29, 2022

Great! I will try.
Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment