Skip to content

Instantly share code, notes, and snippets.

@callahantiff
Last active July 4, 2020 21:37
Show Gist options
  • Save callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f to your computer and use it in GitHub Desktop.
Save callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f to your computer and use it in GitHub Desktop.
GeneMania_DataProcessingPipeline.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "GeneMania_DataProcessingPipeline.ipynb",
"provenance": [],
"collapsed_sections": [],
"mount_file_id": "https://gist.github.com/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f#file-genemania_dataprocessingpipeline-ipynb",
"authorship_tag": "ABX9TyNFzvUgJ5ESgR7n1f2BgIf7",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f/genemania_dataprocessingpipeline.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1cBdztLei-TT",
"colab_type": "text"
},
"source": [
"<img align=\"right\" width=\"300\" alt=\"Screen Shot 2019-12-12 at 21 59 22\" src=\"https://user-images.githubusercontent.com/8030363/70771518-9d1f5980-1d2e-11ea-9201-d5aade3fe376.png\">\n",
"\n",
"## Gene Mania Data Processing Pipeline\n",
"**Creation Date:** `06/16/20` \n",
"**Contact Notebook Author:** [`TJCallahan`](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com) \n",
"\n",
"<br>\n",
"\n",
"### Data \n",
"**Release:** `2017-Mar-14` \n",
"**Downloaded URL:** [`COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt`](http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt) \n",
"**PubMed ID:** [`20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/) \n",
"**Description:** This file contains `3` columns, where each row represents an edge. Within an edge, the first two columns contain a single **[`Ensembl Gene`](https://uswest.ensembl.org/index.html)** identifier and the third column contains a float representing a weight. Please note the following details copied from GeneMANIA's [`Data Archive`](http://pages.genemania.org/data/) page: \n",
"- Each interacting pair of genes will be present exactly once in the file (symmetric interactions are not included) \n",
"- Non-interacting genes are not present \n",
"- No assumptions are made regarding the order of the records in the file or the order of genes in a record\n",
"\n",
"<br>\n",
"\n",
"### Purpose \n",
"The goal of this notebook is to provide a reproducible workflow for downloading gene-gene interaction data from **[`GeneMANIA`](http://pages.genemania.org/)**. This pipeline consists of the following 3 steps: (1) *Download Data* (i.e. data is downloaded into a `Pandas.DataFrame` object directly from the URL referenced above); (2) *Data Processing* (i.e. `GeneMANIA` this workflow provides optional functionality to convert the default provided asymmetric edge list into a symmetric set of edges); and (3) *Data Output* (i.e. the processed edge lis tis output as a tab-delimited `csv` file). \n",
"\n",
"**Data Documentation for use in Publications:** Gene-gene interaction (GGI) data was downloaded from GeneMANIA [**[`PMID:20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/)**] (release date: 03/14/2017). GeneMANIA provides species-specific networks built using co-expression data, physical interactions, genetic interactions, shared protein domains, co-localization, pathways, computational inference, and others (e.g. phenotype, disease, and chemical relationships from OMIM and Ensembl). These relationships are obtained by processing data from GEO, BioGRID, EMBL-EBI, Pfam, Ensembl, NCBI, MGI, I2D, InParanoid, and Pathway Commons. See GeneMANIA's [help page](http://pages.genemania.org/help/#network-data-sources) for more information. While GeneMANIA provides many different types of networks, we utilized the Homo sapiens Combined network, which includes all of the interaction network types described above merged into a single network by leveraging Gene Ontology Biological Process-based functional enrichment analysis (described in detail [here](http://pages.genemania.org/help/#network-data-sources)). The Homo sapien combined GGI data was downloaded on 06/16/20 and all GGIs were included resulting in a analysis set of 13,959,260 asymmetric GGIs.\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F-34WrDXwgG3",
"colab_type": "text"
},
"source": [
"#### Set-up Environment"
]
},
{
"cell_type": "code",
"metadata": {
"id": "cOLAMYHFiza2",
"colab_type": "code",
"colab": {}
},
"source": [
"# load needed libraries\n",
"import ftplib\n",
"import pandas as pd\n",
"\n",
"from contextlib import closing\n",
"from google.colab import drive\n",
"from tqdm import tqdm"
],
"execution_count": 1,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "8_41IUF8wp2E",
"colab_type": "text"
},
"source": [
"#### STEP 1 - DOWNLOAD DATA\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Yf3jIFS4wp_V",
"colab_type": "code",
"colab": {}
},
"source": [
"# GGI URL\n",
"url = 'http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt'\n",
"\n",
"# load data from URL into Pandas DataFrame\n",
"ggi_raw = pd.read_csv(url, sep='\\t', header=0)\n",
"\n",
"# preview first few rows of the data\n",
"ggi_raw.head(n=10)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "4ZzseFaIzI-s",
"colab_type": "text"
},
"source": [
"#### STEP 2 - DATA PROCESSING"
]
},
{
"cell_type": "code",
"metadata": {
"id": "uTkPgz3szfzL",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"outputId": "7d7121db-6194-4e31-b27e-8f9ffe4418cf"
},
"source": [
"# print unique counts of edges and source/target nodes\n",
"ggi_unq = ggi_raw.drop_duplicates()\n",
"\n",
"edges = len(ggi_unq)\n",
"source_nodes = len(set(list(ggi_unq['Gene_A'])))\n",
"target_nodes = len(set(list(ggi_unq['Gene_B'])))\n",
"\n",
"# print counts\n",
"'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n",
" source=source_nodes,\n",
" target=target_nodes)"
],
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic": {
"type": "string"
},
"text/plain": [
"'There are 6979630 unique edges, 19167 unique source nodes, and 19503 unique target nodes'"
]
},
"metadata": {
"tags": []
},
"execution_count": 3
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "20PbFkYCzJJZ",
"colab_type": "code",
"colab": {}
},
"source": [
"# create a symmetric version of data\n",
"source = list(ggi_unq['Gene_A']) + list(ggi_unq['Gene_B'])\n",
"target = list(ggi_unq['Gene_B']) + list(ggi_unq['Gene_A'])\n",
"weight = list(ggi_unq['Weight']) + list(ggi_unq['Weight'])\n",
"\n",
"# convert lists to Pandas DataFrame\n",
"ggi_sym = pd.DataFrame(list(zip(source, target, weight)),\n",
" columns =['Gene_A', 'Gene_B', 'Weight'])\n",
"\n",
"# remove duplicates\n",
"ggi_sym_unq = ggi_sym.drop_duplicates()\n",
"\n",
"edges = len(ggi_sym_unq)\n",
"source_nodes = len(set(list(ggi_sym_unq['Gene_A'])))\n",
"target_nodes = len(set(list(ggi_sym_unq['Gene_B'])))\n",
"\n",
"# print counts\n",
"'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n",
" source=source_nodes,\n",
" target=target_nodes)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "RTKlY-RK5q7w",
"colab_type": "code",
"colab": {}
},
"source": [
"# sort data by Weight to verify file looks correct\n",
"ggi_sym_unq_srt = ggi_sym_unq.sort_values(by=['Weight', 'Gene_A'])\n",
"\n",
"# preview data\n",
"ggi_sym_unq_srt.head(n=10)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "dscIgsJ2zJT8",
"colab_type": "text"
},
"source": [
"#### STEP 3 - DATA OUTPUT"
]
},
{
"cell_type": "code",
"metadata": {
"id": "y38iY8FOzJar",
"colab_type": "code",
"colab": {}
},
"source": [
"# mount GoogleDrive - you will be prompted to authenticate your GoogleDrive\n",
"# if you get stuck follow instructions here: https://stackoverflow.com/questions/49394737/exporting-data-from-google-colab-to-local-machine\n",
"drive.mount('/drive', force_remount=True)"
],
"execution_count": 7,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "eiXaofTMzqqs",
"colab_type": "code",
"colab": {}
},
"source": [
"# save processed DataFrame locally - edges\n",
"ggi_sym_unq_srt.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_16June2020.csv', sep='\\t', header=True, index=False)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "ooeEttC61qgi",
"colab_type": "code",
"colab": {}
},
"source": [
"# save node list\n",
"unique_genes = ggi_sym_unq_srt['Gene_A'].drop_duplicates()\n",
"unique_genes.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_UniqueNodes_04July2020.csv', sep='\\t', header=True, index=False)"
],
"execution_count": 8,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hPRFFcQrxsbe",
"colab_type": "text"
},
"source": [
"### Node Attributes \n",
"In order to make the edges more interpretable, we also pull some node attribute data from the sources listed in the table below. \n",
"\n",
"Tab | Source | Source Version/Release Date | Source URL | Download Date\n",
"-- | -- | -- | -- | --\n",
"Ensembl_HS.GRCh38.100.Uniprot | Ensembl | 100 | [URL](ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz) | 7/4/20\n",
"GOA_human | Gene Ontology Consortium | 6/1/20 | [URL](http://geneontology.org/gene-associations/goa_human.gaf.gz) | 7/4/20\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KXlZjW7iL1_e",
"colab_type": "text"
},
"source": [
"#### Download Node Data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "HzSdELuGEtF8",
"colab_type": "code",
"colab": {}
},
"source": [
"def gzipped_ftp_url_download(url: str, write_location: str):\n",
" \"\"\"Downloads a gzipped file from an ftp server.\n",
"\n",
" Args:\n",
" url: A string that points to the location of a temp mapping file that needs to be processed.\n",
" write_location: A string that points to a file directory.\n",
"\n",
" Returns:\n",
" write_loc: a String containing the directory and filename where the data was downloaded\n",
" \"\"\"\n",
" \n",
" server = url.replace('ftp://', '').split('/')[0]\n",
" directory = '/'.join(url.replace('ftp://', '').split('/')[1:-1])\n",
" file = url.replace('ftp://', '').split('/')[-1]\n",
" write_loc = write_location + '{filename}'.format(filename=file)\n",
"\n",
" print('Downloading Gzipped data from FTP Server: {}'.format(url))\n",
" with closing(ftplib.FTP(server)) as ftp, open(write_loc, 'wb') as fid:\n",
" ftp.login()\n",
" ftp.cwd(directory)\n",
" ftp.retrbinary('RETR {}'.format(file), fid.write)\n",
"\n",
" fid.close()\n",
"\n",
" return write_loc\n",
"\n",
"def convert_to_dict(data, col_a, col_b):\n",
" \"\"\"Converts a Pandas DataFrame into a dictionary.\n",
"\n",
" Args:\n",
" data: A Pandas DataFrame.\n",
" col_a: A string containing a column name to be used as the dicitonary key.\n",
" col_b: A string containing a column name to be used as the dictionary value.\n",
" \n",
" Returns:\n",
" node_metadata: A dictionary where keys are gene identifiers and values are a set of identifiers.\n",
" \"\"\"\n",
"\n",
" node_metadata = dict()\n",
"\n",
" for idx, row in tqdm(data.iterrows(), total=data.shape[0]):\n",
" if row[col_a] in node_metadata:\n",
" node_metadata[row[col_a]] |= {row[col_b]}\n",
" else:\n",
" node_metadata[row[col_a]] = {row[col_b]}\n",
"\n",
" return node_metadata"
],
"execution_count": 62,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "hjcNcR-ECh9D",
"colab_type": "code",
"colab": {}
},
"source": [
"# Ensembl gene - UniProt\n",
"url = 'ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz'\n",
"file_loc = gzipped_ftp_url_download(url, '/drive/My Drive/Colab Notebooks/data/')\n",
"\n",
"# read in data\n",
"ensembl_uniprot = pd.read_csv(file_loc, sep='\\t', header=0, compression='gzip')\n",
"ensembl_uniprot.head(n=5)\n",
"\n",
"# convert to dictionary\n",
"ensembl_uniprot_dict = convert_to_dict(ensembl_uniprot, 'gene_stable_id', 'xref')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "UISXUIiLCiGm",
"colab_type": "code",
"colab": {}
},
"source": [
"# GOA_Human Annotations - Gene Ontology Consortium\n",
"url= 'http://geneontology.org/gene-associations/goa_human.gaf.gz'\n",
"columns = ['DB', 'DB_Object_ID', 'DB_Object_Symbol', 'Qualifier', 'GO_ID', 'DB:Reference',\n",
" 'Evidence_Code', 'With (or) From', 'Aspect', 'DB_Object_Name', 'DB_Object_Synonym',\n",
" 'DB_Object_Type', 'Taxon', 'Date', 'Assigned_By', 'Annotation Extension', 'Gene Product Form ID']\n",
"\n",
"goa = pd.read_csv(url, sep='\\t', header=None, names=columns, compression='gzip', skiprows=32, low_memory=False)\n",
"goa.head(n=5)\n",
"\n",
"# convert to dictionary\n",
"goa_dict_GO = convert_to_dict(goa, 'DB_Object_ID', 'GO_ID')\n",
"goa_dict_GO_aspect = convert_to_dict(goa, 'GO_ID', 'Aspect')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "0ent3pmVL5P3",
"colab_type": "text"
},
"source": [
"### Aggregate Node Data\n",
"Join all of the node data into a single file keyed by node."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Tx8RgyyqMCH2",
"colab_type": "code",
"colab": {}
},
"source": [
"# combine results into single data structure\n",
"data = []\n",
"\n",
"for gene in tqdm(list(unique_genes)):\n",
" if gene in ensembl_uniprot_dict.keys():\n",
" proteins = list(ensembl_uniprot_dict[gene])\n",
" # uniprot id\n",
" for protein in proteins:\n",
" # get go annotations\n",
" if protein in goa_dict_GO.keys():\n",
" for go in goa_dict_GO[protein]:\n",
" data += [[gene, protein, go, list(goa_dict_GO_aspect[go])[0]]]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Bli2m3S9XLJI",
"colab_type": "code",
"colab": {}
},
"source": [
"# convert list to Pnadas DataFrame\n",
"ensembl_gene_annotations = pd.DataFrame({'ensembl_gene_id': [x[0] for x in data],\n",
" 'uniprot_id': [x[1] for x in data],\n",
" 'go_id': [x[2] for x in data],\n",
" 'go_aspect': [x[3] for x in data]})"
],
"execution_count": 116,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "EakfFZj1UUbH",
"colab_type": "code",
"colab": {}
},
"source": [
"# save output\n",
"ensembl_gene_annotations.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_NodeAnnotations_04July2020.csv', sep='\\t', header=True, index=False)"
],
"execution_count": 118,
"outputs": []
}
]
}
@callahantiff
Copy link
Author

Data Exported on 06/16/20 (tab-delimited): GGI_Combined_HomoSapien_16June2020.csv

@callahantiff
Copy link
Author

callahantiff commented Jul 4, 2020

Unique Nodes (07/04/20): GGI_Combined_HomoSapien_UniqueNodes_04July2020.csv
Node Attributes (GOA Annotations - 07/04/20): [GGI_Combined_HomoSapien_NodeAnnotations_04July2020.csv](https://drive.google.com/file/d/1-3iiEoaZc0m4pWEW77Q-bf3EQXg4rh4n/view?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment