callahantiff/genemania_dataprocessingpipeline.ipynb

## genemania_dataprocessingpipeline.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "GeneMania_DataProcessingPipeline.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "mount_file_id": "https://gist.github.com/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f#file-genemania_dataprocessingpipeline-ipynb",
      "authorship_tag": "ABX9TyNFzvUgJ5ESgR7n1f2BgIf7",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f/genemania_dataprocessingpipeline.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1cBdztLei-TT",
        "colab_type": "text"
      },
      "source": [
        "<img align=\"right\" width=\"300\" alt=\"Screen Shot 2019-12-12 at 21 59 22\" src=\"https://user-images.githubusercontent.com/8030363/70771518-9d1f5980-1d2e-11ea-9201-d5aade3fe376.png\">\n",
        "\n",
        "## Gene Mania Data Processing Pipeline\n",
        "**Creation Date:** `06/16/20`  \n",
        "**Contact Notebook Author:** [`TJCallahan`](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  \n",
        "\n",
        "<br>\n",
        "\n",
        "### Data \n",
        "**Release:** `2017-Mar-14`   \n",
        "**Downloaded URL:** [`COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt`](http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt)  \n",
        "**PubMed ID:** [`20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/)   \n",
        "**Description:**  This file contains `3` columns, where each row represents an edge. Within an edge, the first two columns contain a single **[`Ensembl Gene`](https://uswest.ensembl.org/index.html)** identifier and the third column contains a float representing a weight. Please note the following details copied from GeneMANIA's [`Data Archive`](http://pages.genemania.org/data/) page:    \n",
        "- Each interacting pair of genes will be present exactly once in the file (symmetric interactions are not included)  \n",
        "- Non-interacting genes are not present  \n",
        "- No assumptions are made regarding the order of the records in the file or the order of genes in a record\n",
        "\n",
        "<br>\n",
        "\n",
        "### Purpose   \n",
        "The goal of this notebook is to provide a reproducible workflow for downloading gene-gene interaction data from **[`GeneMANIA`](http://pages.genemania.org/)**. This pipeline consists of the following 3 steps: (1) *Download Data* (i.e. data is downloaded into a `Pandas.DataFrame` object directly from the URL referenced above); (2) *Data Processing* (i.e. `GeneMANIA` this workflow provides optional functionality to convert the default provided asymmetric edge list into a symmetric set of edges); and (3) *Data Output* (i.e. the processed edge lis tis output as a tab-delimited `csv` file).  \n",
        "\n",
        "**Data Documentation for use in Publications:** Gene-gene interaction (GGI) data was downloaded from GeneMANIA [**[`PMID:20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/)**] (release date: 03/14/2017). GeneMANIA provides species-specific networks built using co-expression data, physical interactions, genetic interactions, shared protein domains, co-localization, pathways, computational inference, and others (e.g. phenotype, disease, and chemical relationships from OMIM and Ensembl). These relationships are obtained by processing data from GEO, BioGRID, EMBL-EBI, Pfam, Ensembl, NCBI, MGI, I2D, InParanoid, and Pathway Commons. See GeneMANIA's [help page](http://pages.genemania.org/help/#network-data-sources) for more information. While GeneMANIA provides many different types of networks, we utilized the Homo sapiens Combined network, which includes all of the interaction network types described above merged into a single network by leveraging Gene Ontology Biological Process-based functional enrichment analysis (described in detail [here](http://pages.genemania.org/help/#network-data-sources)). The Homo sapien combined GGI data was downloaded on 06/16/20 and all GGIs were included resulting in a analysis set of 13,959,260 asymmetric GGIs.\n",
        "\n",
        " "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F-34WrDXwgG3",
        "colab_type": "text"
      },
      "source": [
        "#### Set-up Environment"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cOLAMYHFiza2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# load needed libraries\n",
        "import ftplib\n",
        "import pandas as pd\n",
        "\n",
        "from contextlib import closing\n",
        "from google.colab import drive\n",
        "from tqdm import tqdm"
      ],
      "execution_count": 1,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8_41IUF8wp2E",
        "colab_type": "text"
      },
      "source": [
        "#### STEP 1 - DOWNLOAD DATA\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Yf3jIFS4wp_V",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# GGI URL\n",
        "url = 'http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt'\n",
        "\n",
        "# load data from URL into Pandas DataFrame\n",
        "ggi_raw = pd.read_csv(url, sep='\\t', header=0)\n",
        "\n",
        "# preview first few rows of the data\n",
        "ggi_raw.head(n=10)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4ZzseFaIzI-s",
        "colab_type": "text"
      },
      "source": [
        "#### STEP 2 - DATA PROCESSING"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uTkPgz3szfzL",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "7d7121db-6194-4e31-b27e-8f9ffe4418cf"
      },
      "source": [
        "# print unique counts of edges and source/target nodes\n",
        "ggi_unq = ggi_raw.drop_duplicates()\n",
        "\n",
        "edges = len(ggi_unq)\n",
        "source_nodes = len(set(list(ggi_unq['Gene_A'])))\n",
        "target_nodes = len(set(list(ggi_unq['Gene_B'])))\n",
        "\n",
        "# print counts\n",
        "'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n",
        "                                                                                                        source=source_nodes,\n",
        "                                                                                                        target=target_nodes)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic": {
              "type": "string"
            },
            "text/plain": [
              "'There are 6979630 unique edges, 19167 unique source nodes, and 19503 unique target nodes'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "20PbFkYCzJJZ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# create a symmetric version of data\n",
        "source = list(ggi_unq['Gene_A']) + list(ggi_unq['Gene_B'])\n",
        "target = list(ggi_unq['Gene_B']) + list(ggi_unq['Gene_A'])\n",
        "weight = list(ggi_unq['Weight']) + list(ggi_unq['Weight'])\n",
        "\n",
        "# convert lists to Pandas DataFrame\n",
        "ggi_sym = pd.DataFrame(list(zip(source, target, weight)),\n",
        "                       columns =['Gene_A', 'Gene_B', 'Weight'])\n",
        "\n",
        "# remove duplicates\n",
        "ggi_sym_unq = ggi_sym.drop_duplicates()\n",
        "\n",
        "edges = len(ggi_sym_unq)\n",
        "source_nodes = len(set(list(ggi_sym_unq['Gene_A'])))\n",
        "target_nodes = len(set(list(ggi_sym_unq['Gene_B'])))\n",
        "\n",
        "# print counts\n",
        "'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n",
        "                                                                                                        source=source_nodes,\n",
        "                                                                                                        target=target_nodes)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RTKlY-RK5q7w",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# sort data by Weight to verify file looks correct\n",
        "ggi_sym_unq_srt = ggi_sym_unq.sort_values(by=['Weight', 'Gene_A'])\n",
        "\n",
        "# preview data\n",
        "ggi_sym_unq_srt.head(n=10)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dscIgsJ2zJT8",
        "colab_type": "text"
      },
      "source": [
        "#### STEP 3 - DATA OUTPUT"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "y38iY8FOzJar",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# mount GoogleDrive - you will be prompted to authenticate your GoogleDrive\n",
        "# if you get stuck follow instructions here: https://stackoverflow.com/questions/49394737/exporting-data-from-google-colab-to-local-machine\n",
        "drive.mount('/drive', force_remount=True)"
      ],
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eiXaofTMzqqs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# save processed DataFrame locally - edges\n",
        "ggi_sym_unq_srt.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_16June2020.csv', sep='\\t', header=True, index=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ooeEttC61qgi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# save node list\n",
        "unique_genes = ggi_sym_unq_srt['Gene_A'].drop_duplicates()\n",
        "unique_genes.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_UniqueNodes_04July2020.csv', sep='\\t', header=True, index=False)"
      ],
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hPRFFcQrxsbe",
        "colab_type": "text"
      },
      "source": [
        "### Node Attributes  \n",
        "In order to make the edges more interpretable, we also pull some node attribute data from the sources listed in the table below. \n",
        "\n",
        "Tab | Source | Source   Version/Release Date | Source URL | Download Date\n",
        "-- | -- | -- | -- | --\n",
        "Ensembl_HS.GRCh38.100.Uniprot | Ensembl | 100 | [URL](ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz) | 7/4/20\n",
        "GOA_human | Gene   Ontology Consortium | 6/1/20 | [URL](http://geneontology.org/gene-associations/goa_human.gaf.gz) | 7/4/20\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KXlZjW7iL1_e",
        "colab_type": "text"
      },
      "source": [
        "#### Download Node Data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "HzSdELuGEtF8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def gzipped_ftp_url_download(url: str, write_location: str):\n",
        "    \"\"\"Downloads a gzipped file from an ftp server.\n",
        "\n",
        "    Args:\n",
        "        url: A string that points to the location of a temp mapping file that needs to be processed.\n",
        "        write_location: A string that points to a file directory.\n",
        "\n",
        "    Returns:\n",
        "        write_loc: a String containing the directory and filename where the data was downloaded\n",
        "    \"\"\"\n",
        "    \n",
        "    server = url.replace('ftp://', '').split('/')[0]\n",
        "    directory = '/'.join(url.replace('ftp://', '').split('/')[1:-1])\n",
        "    file = url.replace('ftp://', '').split('/')[-1]\n",
        "    write_loc = write_location + '{filename}'.format(filename=file)\n",
        "\n",
        "    print('Downloading Gzipped data from FTP Server: {}'.format(url))\n",
        "    with closing(ftplib.FTP(server)) as ftp, open(write_loc, 'wb') as fid:\n",
        "        ftp.login()\n",
        "        ftp.cwd(directory)\n",
        "        ftp.retrbinary('RETR {}'.format(file), fid.write)\n",
        "\n",
        "    fid.close()\n",
        "\n",
        "    return write_loc\n",
        "\n",
        "def convert_to_dict(data, col_a, col_b):\n",
        "    \"\"\"Converts a Pandas DataFrame into a dictionary.\n",
        "\n",
        "    Args:\n",
        "      data: A Pandas DataFrame.\n",
        "      col_a: A string containing a column name to be used as the dicitonary key.\n",
        "      col_b: A string containing a column name to be used as the dictionary value.\n",
        "    \n",
        "    Returns:\n",
        "      node_metadata: A dictionary where keys are gene identifiers and values are a set of identifiers.\n",
        "    \"\"\"\n",
        "\n",
        "    node_metadata = dict()\n",
        "\n",
        "    for idx, row in tqdm(data.iterrows(), total=data.shape[0]):\n",
        "      if row[col_a] in node_metadata:\n",
        "        node_metadata[row[col_a]] |= {row[col_b]}\n",
        "      else:\n",
        "        node_metadata[row[col_a]] = {row[col_b]}\n",
        "\n",
        "    return node_metadata"
      ],
      "execution_count": 62,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hjcNcR-ECh9D",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Ensembl gene - UniProt\n",
        "url = 'ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz'\n",
        "file_loc = gzipped_ftp_url_download(url, '/drive/My Drive/Colab Notebooks/data/')\n",
        "\n",
        "# read in data\n",
        "ensembl_uniprot = pd.read_csv(file_loc, sep='\\t', header=0, compression='gzip')\n",
        "ensembl_uniprot.head(n=5)\n",
        "\n",
        "# convert to dictionary\n",
        "ensembl_uniprot_dict = convert_to_dict(ensembl_uniprot, 'gene_stable_id', 'xref')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UISXUIiLCiGm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# GOA_Human Annotations - Gene Ontology Consortium\n",
        "url= 'http://geneontology.org/gene-associations/goa_human.gaf.gz'\n",
        "columns = ['DB', 'DB_Object_ID', 'DB_Object_Symbol', 'Qualifier', 'GO_ID', 'DB:Reference',\n",
        "           'Evidence_Code', 'With (or) From', 'Aspect', 'DB_Object_Name', 'DB_Object_Synonym',\n",
        "           'DB_Object_Type', 'Taxon', 'Date', 'Assigned_By', 'Annotation Extension', 'Gene Product Form ID']\n",
        "\n",
        "goa = pd.read_csv(url, sep='\\t', header=None, names=columns, compression='gzip', skiprows=32, low_memory=False)\n",
        "goa.head(n=5)\n",
        "\n",
        "# convert to dictionary\n",
        "goa_dict_GO = convert_to_dict(goa, 'DB_Object_ID', 'GO_ID')\n",
        "goa_dict_GO_aspect = convert_to_dict(goa, 'GO_ID', 'Aspect')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0ent3pmVL5P3",
        "colab_type": "text"
      },
      "source": [
        "### Aggregate Node Data\n",
        "Join all of the node data into a single file keyed by node."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Tx8RgyyqMCH2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# combine results into single data structure\n",
        "data = []\n",
        "\n",
        "for gene in tqdm(list(unique_genes)):\n",
        "  if gene in ensembl_uniprot_dict.keys():\n",
        "    proteins = list(ensembl_uniprot_dict[gene])\n",
        "    # uniprot id\n",
        "    for protein in proteins:\n",
        "      # get go annotations\n",
        "      if protein in goa_dict_GO.keys():\n",
        "        for go in goa_dict_GO[protein]:\n",
        "          data += [[gene, protein, go, list(goa_dict_GO_aspect[go])[0]]]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Bli2m3S9XLJI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# convert list to Pnadas DataFrame\n",
        "ensembl_gene_annotations = pd.DataFrame({'ensembl_gene_id': [x[0] for x in data],\n",
        "                                         'uniprot_id': [x[1] for x in data],\n",
        "                                         'go_id': [x[2] for x in data],\n",
        "                                         'go_aspect': [x[3] for x in data]})"
      ],
      "execution_count": 116,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EakfFZj1UUbH",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# save output\n",
        "ensembl_gene_annotations.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_NodeAnnotations_04July2020.csv', sep='\\t', header=True, index=False)"
      ],
      "execution_count": 118,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "GeneMania_DataProcessingPipeline.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"mount_file_id": "https://gist.github.com/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f#file-genemania_dataprocessingpipeline-ipynb",
	"authorship_tag": "ABX9TyNFzvUgJ5ESgR7n1f2BgIf7",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f/genemania_dataprocessingpipeline.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "1cBdztLei-TT",
	"colab_type": "text"
	},
	"source": [
	"<img align=\"right\" width=\"300\" alt=\"Screen Shot 2019-12-12 at 21 59 22\" src=\"https://user-images.githubusercontent.com/8030363/70771518-9d1f5980-1d2e-11ea-9201-d5aade3fe376.png\">\n",
	"\n",
	"## Gene Mania Data Processing Pipeline\n",
	"Creation Date: `06/16/20` \n",
	"Contact Notebook Author: [`TJCallahan`](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com) \n",
	"\n",
	"<br>\n",
	"\n",
	"### Data \n",
	"Release: `2017-Mar-14` \n",
	"Downloaded URL: [`COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt`](http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt) \n",
	"PubMed ID: [`20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/) \n",
	"Description: This file contains `3` columns, where each row represents an edge. Within an edge, the first two columns contain a single [`Ensembl Gene`](https://uswest.ensembl.org/index.html) identifier and the third column contains a float representing a weight. Please note the following details copied from GeneMANIA's [`Data Archive`](http://pages.genemania.org/data/) page: \n",
	"- Each interacting pair of genes will be present exactly once in the file (symmetric interactions are not included) \n",
	"- Non-interacting genes are not present \n",
	"- No assumptions are made regarding the order of the records in the file or the order of genes in a record\n",
	"\n",
	"<br>\n",
	"\n",
	"### Purpose \n",
	"The goal of this notebook is to provide a reproducible workflow for downloading gene-gene interaction data from [`GeneMANIA`](http://pages.genemania.org/). This pipeline consists of the following 3 steps: (1) Download Data (i.e. data is downloaded into a `Pandas.DataFrame` object directly from the URL referenced above); (2) Data Processing (i.e. `GeneMANIA` this workflow provides optional functionality to convert the default provided asymmetric edge list into a symmetric set of edges); and (3) Data Output (i.e. the processed edge lis tis output as a tab-delimited `csv` file). \n",
	"\n",
	"Data Documentation for use in Publications: Gene-gene interaction (GGI) data was downloaded from GeneMANIA [[`PMID:20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/)] (release date: 03/14/2017). GeneMANIA provides species-specific networks built using co-expression data, physical interactions, genetic interactions, shared protein domains, co-localization, pathways, computational inference, and others (e.g. phenotype, disease, and chemical relationships from OMIM and Ensembl). These relationships are obtained by processing data from GEO, BioGRID, EMBL-EBI, Pfam, Ensembl, NCBI, MGI, I2D, InParanoid, and Pathway Commons. See GeneMANIA's [help page](http://pages.genemania.org/help/#network-data-sources) for more information. While GeneMANIA provides many different types of networks, we utilized the Homo sapiens Combined network, which includes all of the interaction network types described above merged into a single network by leveraging Gene Ontology Biological Process-based functional enrichment analysis (described in detail [here](http://pages.genemania.org/help/#network-data-sources)). The Homo sapien combined GGI data was downloaded on 06/16/20 and all GGIs were included resulting in a analysis set of 13,959,260 asymmetric GGIs.\n",
	"\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "F-34WrDXwgG3",
	"colab_type": "text"
	},
	"source": [
	"#### Set-up Environment"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "cOLAMYHFiza2",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# load needed libraries\n",
	"import ftplib\n",
	"import pandas as pd\n",
	"\n",
	"from contextlib import closing\n",
	"from google.colab import drive\n",
	"from tqdm import tqdm"
	],
	"execution_count": 1,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "8_41IUF8wp2E",
	"colab_type": "text"
	},
	"source": [
	"#### STEP 1 - DOWNLOAD DATA\n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Yf3jIFS4wp_V",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# GGI URL\n",
	"url = 'http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt'\n",
	"\n",
	"# load data from URL into Pandas DataFrame\n",
	"ggi_raw = pd.read_csv(url, sep='\\t', header=0)\n",
	"\n",
	"# preview first few rows of the data\n",
	"ggi_raw.head(n=10)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "4ZzseFaIzI-s",
	"colab_type": "text"
	},
	"source": [
	"#### STEP 2 - DATA PROCESSING"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "uTkPgz3szfzL",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 35
	},
	"outputId": "7d7121db-6194-4e31-b27e-8f9ffe4418cf"
	},
	"source": [
	"# print unique counts of edges and source/target nodes\n",
	"ggi_unq = ggi_raw.drop_duplicates()\n",
	"\n",
	"edges = len(ggi_unq)\n",
	"source_nodes = len(set(list(ggi_unq['Gene_A'])))\n",
	"target_nodes = len(set(list(ggi_unq['Gene_B'])))\n",
	"\n",
	"# print counts\n",
	"'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n",
	" source=source_nodes,\n",
	" target=target_nodes)"
	],
	"execution_count": 3,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"application/vnd.google.colaboratory.intrinsic": {
	"type": "string"
	},
	"text/plain": [
	"'There are 6979630 unique edges, 19167 unique source nodes, and 19503 unique target nodes'"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 3
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "20PbFkYCzJJZ",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# create a symmetric version of data\n",
	"source = list(ggi_unq['Gene_A']) + list(ggi_unq['Gene_B'])\n",
	"target = list(ggi_unq['Gene_B']) + list(ggi_unq['Gene_A'])\n",
	"weight = list(ggi_unq['Weight']) + list(ggi_unq['Weight'])\n",
	"\n",
	"# convert lists to Pandas DataFrame\n",
	"ggi_sym = pd.DataFrame(list(zip(source, target, weight)),\n",
	" columns =['Gene_A', 'Gene_B', 'Weight'])\n",
	"\n",
	"# remove duplicates\n",
	"ggi_sym_unq = ggi_sym.drop_duplicates()\n",
	"\n",
	"edges = len(ggi_sym_unq)\n",
	"source_nodes = len(set(list(ggi_sym_unq['Gene_A'])))\n",
	"target_nodes = len(set(list(ggi_sym_unq['Gene_B'])))\n",
	"\n",
	"# print counts\n",
	"'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n",
	" source=source_nodes,\n",
	" target=target_nodes)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "RTKlY-RK5q7w",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# sort data by Weight to verify file looks correct\n",
	"ggi_sym_unq_srt = ggi_sym_unq.sort_values(by=['Weight', 'Gene_A'])\n",
	"\n",
	"# preview data\n",
	"ggi_sym_unq_srt.head(n=10)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "dscIgsJ2zJT8",
	"colab_type": "text"
	},
	"source": [
	"#### STEP 3 - DATA OUTPUT"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "y38iY8FOzJar",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# mount GoogleDrive - you will be prompted to authenticate your GoogleDrive\n",
	"# if you get stuck follow instructions here: https://stackoverflow.com/questions/49394737/exporting-data-from-google-colab-to-local-machine\n",
	"drive.mount('/drive', force_remount=True)"
	],
	"execution_count": 7,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "eiXaofTMzqqs",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# save processed DataFrame locally - edges\n",
	"ggi_sym_unq_srt.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_16June2020.csv', sep='\\t', header=True, index=False)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ooeEttC61qgi",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# save node list\n",
	"unique_genes = ggi_sym_unq_srt['Gene_A'].drop_duplicates()\n",
	"unique_genes.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_UniqueNodes_04July2020.csv', sep='\\t', header=True, index=False)"
	],
	"execution_count": 8,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "hPRFFcQrxsbe",
	"colab_type": "text"
	},
	"source": [
	"### Node Attributes \n",
	"In order to make the edges more interpretable, we also pull some node attribute data from the sources listed in the table below. \n",
	"\n",
	"Tab \| Source \| Source Version/Release Date \| Source URL \| Download Date\n",
	"-- \| -- \| -- \| -- \| --\n",
	"Ensembl_HS.GRCh38.100.Uniprot \| Ensembl \| 100 \| [URL](ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz) \| 7/4/20\n",
	"GOA_human \| Gene Ontology Consortium \| 6/1/20 \| [URL](http://geneontology.org/gene-associations/goa_human.gaf.gz) \| 7/4/20\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "KXlZjW7iL1_e",
	"colab_type": "text"
	},
	"source": [
	"#### Download Node Data"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "HzSdELuGEtF8",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def gzipped_ftp_url_download(url: str, write_location: str):\n",
	" \"\"\"Downloads a gzipped file from an ftp server.\n",
	"\n",
	" Args:\n",
	" url: A string that points to the location of a temp mapping file that needs to be processed.\n",
	" write_location: A string that points to a file directory.\n",
	"\n",
	" Returns:\n",
	" write_loc: a String containing the directory and filename where the data was downloaded\n",
	" \"\"\"\n",
	" \n",
	" server = url.replace('ftp://', '').split('/')[0]\n",
	" directory = '/'.join(url.replace('ftp://', '').split('/')[1:-1])\n",
	" file = url.replace('ftp://', '').split('/')[-1]\n",
	" write_loc = write_location + '{filename}'.format(filename=file)\n",
	"\n",
	" print('Downloading Gzipped data from FTP Server: {}'.format(url))\n",
	" with closing(ftplib.FTP(server)) as ftp, open(write_loc, 'wb') as fid:\n",
	" ftp.login()\n",
	" ftp.cwd(directory)\n",
	" ftp.retrbinary('RETR {}'.format(file), fid.write)\n",
	"\n",
	" fid.close()\n",
	"\n",
	" return write_loc\n",
	"\n",
	"def convert_to_dict(data, col_a, col_b):\n",
	" \"\"\"Converts a Pandas DataFrame into a dictionary.\n",
	"\n",
	" Args:\n",
	" data: A Pandas DataFrame.\n",
	" col_a: A string containing a column name to be used as the dicitonary key.\n",
	" col_b: A string containing a column name to be used as the dictionary value.\n",
	" \n",
	" Returns:\n",
	" node_metadata: A dictionary where keys are gene identifiers and values are a set of identifiers.\n",
	" \"\"\"\n",
	"\n",
	" node_metadata = dict()\n",
	"\n",
	" for idx, row in tqdm(data.iterrows(), total=data.shape[0]):\n",
	" if row[col_a] in node_metadata:\n",
	" node_metadata[row[col_a]] \|= {row[col_b]}\n",
	" else:\n",
	" node_metadata[row[col_a]] = {row[col_b]}\n",
	"\n",
	" return node_metadata"
	],
	"execution_count": 62,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "hjcNcR-ECh9D",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Ensembl gene - UniProt\n",
	"url = 'ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz'\n",
	"file_loc = gzipped_ftp_url_download(url, '/drive/My Drive/Colab Notebooks/data/')\n",
	"\n",
	"# read in data\n",
	"ensembl_uniprot = pd.read_csv(file_loc, sep='\\t', header=0, compression='gzip')\n",
	"ensembl_uniprot.head(n=5)\n",
	"\n",
	"# convert to dictionary\n",
	"ensembl_uniprot_dict = convert_to_dict(ensembl_uniprot, 'gene_stable_id', 'xref')"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "UISXUIiLCiGm",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# GOA_Human Annotations - Gene Ontology Consortium\n",
	"url= 'http://geneontology.org/gene-associations/goa_human.gaf.gz'\n",
	"columns = ['DB', 'DB_Object_ID', 'DB_Object_Symbol', 'Qualifier', 'GO_ID', 'DB:Reference',\n",
	" 'Evidence_Code', 'With (or) From', 'Aspect', 'DB_Object_Name', 'DB_Object_Synonym',\n",
	" 'DB_Object_Type', 'Taxon', 'Date', 'Assigned_By', 'Annotation Extension', 'Gene Product Form ID']\n",
	"\n",
	"goa = pd.read_csv(url, sep='\\t', header=None, names=columns, compression='gzip', skiprows=32, low_memory=False)\n",
	"goa.head(n=5)\n",
	"\n",
	"# convert to dictionary\n",
	"goa_dict_GO = convert_to_dict(goa, 'DB_Object_ID', 'GO_ID')\n",
	"goa_dict_GO_aspect = convert_to_dict(goa, 'GO_ID', 'Aspect')"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "0ent3pmVL5P3",
	"colab_type": "text"
	},
	"source": [
	"### Aggregate Node Data\n",
	"Join all of the node data into a single file keyed by node."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Tx8RgyyqMCH2",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# combine results into single data structure\n",
	"data = []\n",
	"\n",
	"for gene in tqdm(list(unique_genes)):\n",
	" if gene in ensembl_uniprot_dict.keys():\n",
	" proteins = list(ensembl_uniprot_dict[gene])\n",
	" # uniprot id\n",
	" for protein in proteins:\n",
	" # get go annotations\n",
	" if protein in goa_dict_GO.keys():\n",
	" for go in goa_dict_GO[protein]:\n",
	" data += [[gene, protein, go, list(goa_dict_GO_aspect[go])[0]]]"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Bli2m3S9XLJI",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# convert list to Pnadas DataFrame\n",
	"ensembl_gene_annotations = pd.DataFrame({'ensembl_gene_id': [x[0] for x in data],\n",
	" 'uniprot_id': [x[1] for x in data],\n",
	" 'go_id': [x[2] for x in data],\n",
	" 'go_aspect': [x[3] for x in data]})"
	],
	"execution_count": 116,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "EakfFZj1UUbH",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# save output\n",
	"ensembl_gene_annotations.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_NodeAnnotations_04July2020.csv', sep='\\t', header=True, index=False)"
	],
	"execution_count": 118,
	"outputs": []
	}
	]
	}