Skip to content

Instantly share code, notes, and snippets.

@ljmartin
Created January 17, 2020 05:54
Show Gist options
  • Save ljmartin/90b7058ddc33a8d43fe05fa20e6f4194 to your computer and use it in GitHub Desktop.
Save ljmartin/90b7058ddc33a8d43fe05fa20e6f4194 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import bq_helper\n",
"from bq_helper import BigQueryHelper\n",
"\n",
"#you will need to set credentials to use google's BigQuery\n",
"#see: https://cloud.google.com/docs/authentication/getting-started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to download activity data from ChEMBL from the cloud\n",
"\n",
"Google performs the queries and returns only the relevant info. Saves ~15GB since you don't need a local copy of the chembl database. Takes about 10-15mins. Must have bq_helper, see https://github.com/SohierDane/BigQuery_Helper"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up a BigQuery helper instance:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"ebi_chembl = bq_helper.BigQueryHelper(active_project=\"patents-public-data\",\n",
" dataset_name=\"ebi_chembl\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View available tables as sanity check:\n",
"Just showing first ten here"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['action_type',\n",
" 'action_type_23',\n",
" 'action_type_24',\n",
" 'activities',\n",
" 'activities_23',\n",
" 'activities_24',\n",
" 'activity_properties',\n",
" 'activity_properties_24',\n",
" 'activity_smid',\n",
" 'activity_smid_24']"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# View table names under the ebi_chembl data table\n",
"bq_assistant = BigQueryHelper(\"patents-public-data\", \"ebi_chembl\")\n",
"bq_assistant.list_tables()[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View the available columns in a given table:\n",
"\n",
"This helped with knowing what to ask for in the queries below. Example here prints the columns for the 'target_dictionary' table."
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['tid', 'target_type', 'pref_name', 'tax_id', 'organism', 'chembl_id',\n",
" 'species_group_flag'],\n",
" dtype='object')"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"# View the first three rows of the tissue_dictionary_23 data table\n",
"b =bq_assistant.head(\"target_dictionary\", num_rows=3)\n",
"b.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Save tables for targets, activities, assays, molecules, and structures\n",
"\n",
"Uncomment the 'estimate_query' line to check how big the download will be (in GB) before actually saving it.\n",
"There is a (large) limit of 5TB per month apparently. "
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"#Target dictionary:\n",
"query1 = \"\"\"\n",
"SELECT \n",
" chembl_id, organism, pref_name, tid\n",
"FROM\n",
" `patents-public-data.ebi_chembl.target_dictionary`\n",
"WHERE\n",
" organism in ('Homo sapiens')\n",
"\"\"\"\n",
"\n",
"#bq_assistant.estimate_query_size(query1)\n",
"target_dict = ebi_chembl.query_to_pandas_safe(query1)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"##Activities:\n",
"\n",
"query1 = \"\"\"\n",
"SELECT\n",
" assay_id, molregno, activity_comment, pchembl_value\n",
"FROM\n",
" `patents-public-data.ebi_chembl.activities`\n",
"WHERE\n",
" activity_comment in ('Active', 'active', 'Partial agonist', 'Agonist', 'Antagonist') or pchembl_value>\"5\"\n",
" \n",
"\"\"\"\n",
"\n",
"#bq_assistant.estimate_query_size(query1)\n",
"activities = ebi_chembl.query_to_pandas_safe(query1)"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"\n",
"##Assays\n",
"\n",
"query1 = \"\"\"\n",
"SELECT\n",
" chembl_id, assay_id, assay_organism, tid\n",
"FROM\n",
" `patents-public-data.ebi_chembl.assays`\n",
"WHERE\n",
" assay_organism in ('Homo sapiens')\n",
"\n",
"\"\"\"\n",
"\n",
"#bq_assistant.estimate_query_size(query1)\n",
"assays = ebi_chembl.query_to_pandas_safe(query1)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"##Molecule dictionary:\n",
"\n",
"query1 = \"\"\"\n",
"SELECT\n",
" molregno, chembl_id\n",
"FROM\n",
" `patents-public-data.ebi_chembl.molecule_dictionary`\n",
"\n",
"\"\"\"\n",
"#bq_assistant.estimate_query_size(query1)\n",
"molecule_dict = ebi_chembl.query_to_pandas_safe(query1)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"##Compound structures:\n",
"\n",
"query1 = \"\"\"\n",
"SELECT\n",
" molregno, canonical_smiles\n",
"FROM\n",
" `patents-public-data.ebi_chembl.compound_structures`\n",
"\n",
"\"\"\"\n",
"bq_assistant.estimate_query_size(query1)\n",
"compound_structures = ebi_chembl.query_to_pandas_safe(query1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Perform inner joins as you would in SQL:"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [],
"source": [
"df = pd.merge(compound_structures, molecule_dict, on='molregno')\n"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.merge(df, activities, on='molregno')\n"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [],
"source": [
"assays = assays.drop('chembl_id', axis=1)\n",
"df3 = pd.merge(df2, assays, on='assay_id')"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"target_dict = target_dict.drop('chembl_id', axis=1)\n",
"df4 = pd.merge(df3, target_dict, on='tid')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Print out activites:\n",
"The columns here contain enough data to parse and build a virtual screening or do any other analysis."
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>molregno</th>\n",
" <th>canonical_smiles</th>\n",
" <th>chembl_id</th>\n",
" <th>assay_id</th>\n",
" <th>activity_comment</th>\n",
" <th>pchembl_value</th>\n",
" <th>assay_organism</th>\n",
" <th>tid</th>\n",
" <th>organism</th>\n",
" <th>pref_name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>97</td>\n",
" <td>COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)c4occc4</td>\n",
" <td>CHEMBL2</td>\n",
" <td>774675</td>\n",
" <td>None</td>\n",
" <td>5.39</td>\n",
" <td>Homo sapiens</td>\n",
" <td>13000</td>\n",
" <td>Homo sapiens</td>\n",
" <td>Matrix metalloproteinase-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5985</td>\n",
" <td>CCc1c2CN3C(=O)C4=C(C=C3c2nc5ccc(OC(=O)N6CCC(CC...</td>\n",
" <td>CHEMBL481</td>\n",
" <td>774675</td>\n",
" <td>None</td>\n",
" <td>5.22</td>\n",
" <td>Homo sapiens</td>\n",
" <td>13000</td>\n",
" <td>Homo sapiens</td>\n",
" <td>Matrix metalloproteinase-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>33373</td>\n",
" <td>COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)C4COc5cc...</td>\n",
" <td>CHEMBL707</td>\n",
" <td>774675</td>\n",
" <td>None</td>\n",
" <td>5.24</td>\n",
" <td>Homo sapiens</td>\n",
" <td>13000</td>\n",
" <td>Homo sapiens</td>\n",
" <td>Matrix metalloproteinase-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>69926</td>\n",
" <td>CC[C@H](N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)...</td>\n",
" <td>CHEMBL47907</td>\n",
" <td>101742</td>\n",
" <td>None</td>\n",
" <td>6.64</td>\n",
" <td>Homo sapiens</td>\n",
" <td>13000</td>\n",
" <td>Homo sapiens</td>\n",
" <td>Matrix metalloproteinase-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>156907</td>\n",
" <td>CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(C)C)N[C@...</td>\n",
" <td>CHEMBL330729</td>\n",
" <td>101742</td>\n",
" <td>None</td>\n",
" <td>6.82</td>\n",
" <td>Homo sapiens</td>\n",
" <td>13000</td>\n",
" <td>Homo sapiens</td>\n",
" <td>Matrix metalloproteinase-1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" molregno canonical_smiles chembl_id \\\n",
"0 97 COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)c4occc4 CHEMBL2 \n",
"1 5985 CCc1c2CN3C(=O)C4=C(C=C3c2nc5ccc(OC(=O)N6CCC(CC... CHEMBL481 \n",
"2 33373 COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)C4COc5cc... CHEMBL707 \n",
"3 69926 CC[C@H](N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)... CHEMBL47907 \n",
"4 156907 CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(C)C)N[C@... CHEMBL330729 \n",
"\n",
" assay_id activity_comment pchembl_value assay_organism tid organism \\\n",
"0 774675 None 5.39 Homo sapiens 13000 Homo sapiens \n",
"1 774675 None 5.22 Homo sapiens 13000 Homo sapiens \n",
"2 774675 None 5.24 Homo sapiens 13000 Homo sapiens \n",
"3 101742 None 6.64 Homo sapiens 13000 Homo sapiens \n",
"4 101742 None 6.82 Homo sapiens 13000 Homo sapiens \n",
"\n",
" pref_name \n",
"0 Matrix metalloproteinase-1 \n",
"1 Matrix metalloproteinase-1 \n",
"2 Matrix metalloproteinase-1 \n",
"3 Matrix metalloproteinase-1 \n",
"4 Matrix metalloproteinase-1 "
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment