ljmartin/chembl_assay_download.ipynb

## chembl_assay_download.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              chembl_assay_download.ipynb
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## gistfile1.txt
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import bq_helper\n",
    "from bq_helper import BigQueryHelper\n",
    "\n",
    "#you will need to set credentials to use google's BigQuery\n",
    "#see: https://cloud.google.com/docs/authentication/getting-started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to download activity data from ChEMBL from the cloud\n",
    "\n",
    "Google performs the queries and returns only the relevant info. Saves ~15GB since you don't need a local copy of the chembl database. Takes about 10-15mins. Must have bq_helper, see https://github.com/SohierDane/BigQuery_Helper"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up a BigQuery helper instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "ebi_chembl = bq_helper.BigQueryHelper(active_project=\"patents-public-data\",\n",
    "                                   dataset_name=\"ebi_chembl\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View available tables as sanity check:\n",
    "Just showing first ten here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['action_type',\n",
       " 'action_type_23',\n",
       " 'action_type_24',\n",
       " 'activities',\n",
       " 'activities_23',\n",
       " 'activities_24',\n",
       " 'activity_properties',\n",
       " 'activity_properties_24',\n",
       " 'activity_smid',\n",
       " 'activity_smid_24']"
      ]
     },
     "execution_count": 147,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# View table names under the ebi_chembl data table\n",
    "bq_assistant = BigQueryHelper(\"patents-public-data\", \"ebi_chembl\")\n",
    "bq_assistant.list_tables()[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View the available columns in a given table:\n",
    "\n",
    "This helped with knowing what to ask for in the queries below. Example here prints the columns for the 'target_dictionary' table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['tid', 'target_type', 'pref_name', 'tax_id', 'organism', 'chembl_id',\n",
       "       'species_group_flag'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "# View the first three rows of the tissue_dictionary_23 data table\n",
    "b =bq_assistant.head(\"target_dictionary\", num_rows=3)\n",
    "b.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Save tables for targets, activities, assays, molecules, and structures\n",
    "\n",
    "Uncomment the 'estimate_query' line to check how big the download will be (in GB) before actually saving it.\n",
    "There is a (large) limit of 5TB per month apparently. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "#Target dictionary:\n",
    "query1 = \"\"\"\n",
    "SELECT \n",
    "    chembl_id, organism, pref_name, tid\n",
    "FROM\n",
    "  `patents-public-data.ebi_chembl.target_dictionary`\n",
    "WHERE\n",
    "    organism in ('Homo sapiens')\n",
    "\"\"\"\n",
    "\n",
    "#bq_assistant.estimate_query_size(query1)\n",
    "target_dict = ebi_chembl.query_to_pandas_safe(query1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "##Activities:\n",
    "\n",
    "query1 = \"\"\"\n",
    "SELECT\n",
    "    assay_id, molregno, activity_comment, pchembl_value\n",
    "FROM\n",
    "  `patents-public-data.ebi_chembl.activities`\n",
    "WHERE\n",
    "    activity_comment in ('Active', 'active', 'Partial agonist', 'Agonist', 'Antagonist') or pchembl_value>\"5\"\n",
    "    \n",
    "\"\"\"\n",
    "\n",
    "#bq_assistant.estimate_query_size(query1)\n",
    "activities = ebi_chembl.query_to_pandas_safe(query1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "##Assays\n",
    "\n",
    "query1 = \"\"\"\n",
    "SELECT\n",
    "    chembl_id, assay_id, assay_organism, tid\n",
    "FROM\n",
    "  `patents-public-data.ebi_chembl.assays`\n",
    "WHERE\n",
    "    assay_organism in ('Homo sapiens')\n",
    "\n",
    "\"\"\"\n",
    "\n",
    "#bq_assistant.estimate_query_size(query1)\n",
    "assays = ebi_chembl.query_to_pandas_safe(query1)\n",
    "      "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "##Molecule dictionary:\n",
    "\n",
    "query1 = \"\"\"\n",
    "SELECT\n",
    "    molregno, chembl_id\n",
    "FROM\n",
    "  `patents-public-data.ebi_chembl.molecule_dictionary`\n",
    "\n",
    "\"\"\"\n",
    "#bq_assistant.estimate_query_size(query1)\n",
    "molecule_dict = ebi_chembl.query_to_pandas_safe(query1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "##Compound structures:\n",
    "\n",
    "query1 = \"\"\"\n",
    "SELECT\n",
    "    molregno, canonical_smiles\n",
    "FROM\n",
    "  `patents-public-data.ebi_chembl.compound_structures`\n",
    "\n",
    "\"\"\"\n",
    "bq_assistant.estimate_query_size(query1)\n",
    "compound_structures = ebi_chembl.query_to_pandas_safe(query1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Perform inner joins as you would in SQL:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.merge(compound_structures, molecule_dict, on='molregno')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.merge(df, activities, on='molregno')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {},
   "outputs": [],
   "source": [
    "assays = assays.drop('chembl_id', axis=1)\n",
    "df3 = pd.merge(df2, assays, on='assay_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "target_dict = target_dict.drop('chembl_id', axis=1)\n",
    "df4 = pd.merge(df3, target_dict, on='tid')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Print out activites:\n",
    "The columns here contain enough data to parse and build a virtual screening or do any other analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>molregno</th>\n",
       "      <th>canonical_smiles</th>\n",
       "      <th>chembl_id</th>\n",
       "      <th>assay_id</th>\n",
       "      <th>activity_comment</th>\n",
       "      <th>pchembl_value</th>\n",
       "      <th>assay_organism</th>\n",
       "      <th>tid</th>\n",
       "      <th>organism</th>\n",
       "      <th>pref_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>97</td>\n",
       "      <td>COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)c4occc4</td>\n",
       "      <td>CHEMBL2</td>\n",
       "      <td>774675</td>\n",
       "      <td>None</td>\n",
       "      <td>5.39</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>13000</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Matrix metalloproteinase-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5985</td>\n",
       "      <td>CCc1c2CN3C(=O)C4=C(C=C3c2nc5ccc(OC(=O)N6CCC(CC...</td>\n",
       "      <td>CHEMBL481</td>\n",
       "      <td>774675</td>\n",
       "      <td>None</td>\n",
       "      <td>5.22</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>13000</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Matrix metalloproteinase-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>33373</td>\n",
       "      <td>COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)C4COc5cc...</td>\n",
       "      <td>CHEMBL707</td>\n",
       "      <td>774675</td>\n",
       "      <td>None</td>\n",
       "      <td>5.24</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>13000</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Matrix metalloproteinase-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>69926</td>\n",
       "      <td>CC[C@H](N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)...</td>\n",
       "      <td>CHEMBL47907</td>\n",
       "      <td>101742</td>\n",
       "      <td>None</td>\n",
       "      <td>6.64</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>13000</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Matrix metalloproteinase-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>156907</td>\n",
       "      <td>CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(C)C)N[C@...</td>\n",
       "      <td>CHEMBL330729</td>\n",
       "      <td>101742</td>\n",
       "      <td>None</td>\n",
       "      <td>6.82</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>13000</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Matrix metalloproteinase-1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  molregno                                   canonical_smiles     chembl_id  \\\n",
       "0       97      COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)c4occc4       CHEMBL2   \n",
       "1     5985  CCc1c2CN3C(=O)C4=C(C=C3c2nc5ccc(OC(=O)N6CCC(CC...     CHEMBL481   \n",
       "2    33373  COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)C4COc5cc...     CHEMBL707   \n",
       "3    69926  CC[C@H](N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)...   CHEMBL47907   \n",
       "4   156907  CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(C)C)N[C@...  CHEMBL330729   \n",
       "\n",
       "  assay_id activity_comment pchembl_value assay_organism    tid      organism  \\\n",
       "0   774675             None          5.39   Homo sapiens  13000  Homo sapiens   \n",
       "1   774675             None          5.22   Homo sapiens  13000  Homo sapiens   \n",
       "2   774675             None          5.24   Homo sapiens  13000  Homo sapiens   \n",
       "3   101742             None          6.64   Homo sapiens  13000  Homo sapiens   \n",
       "4   101742             None          6.82   Homo sapiens  13000  Homo sapiens   \n",
       "\n",
       "                    pref_name  \n",
       "0  Matrix metalloproteinase-1  \n",
       "1  Matrix metalloproteinase-1  \n",
       "2  Matrix metalloproteinase-1  \n",
       "3  Matrix metalloproteinase-1  \n",
       "4  Matrix metalloproteinase-1  "
      ]
     },
     "execution_count": 146,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df4.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import bq_helper\n",
	"from bq_helper import BigQueryHelper\n",
	"\n",
	"#you will need to set credentials to use google's BigQuery\n",
	"#see: https://cloud.google.com/docs/authentication/getting-started"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# How to download activity data from ChEMBL from the cloud\n",
	"\n",
	"Google performs the queries and returns only the relevant info. Saves ~15GB since you don't need a local copy of the chembl database. Takes about 10-15mins. Must have bq_helper, see https://github.com/SohierDane/BigQuery_Helper"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Set up a BigQuery helper instance:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"ebi_chembl = bq_helper.BigQueryHelper(active_project=\"patents-public-data\",\n",
	" dataset_name=\"ebi_chembl\")\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### View available tables as sanity check:\n",
	"Just showing first ten here"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 147,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['action_type',\n",
	" 'action_type_23',\n",
	" 'action_type_24',\n",
	" 'activities',\n",
	" 'activities_23',\n",
	" 'activities_24',\n",
	" 'activity_properties',\n",
	" 'activity_properties_24',\n",
	" 'activity_smid',\n",
	" 'activity_smid_24']"
	]
	},
	"execution_count": 147,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# View table names under the ebi_chembl data table\n",
	"bq_assistant = BigQueryHelper(\"patents-public-data\", \"ebi_chembl\")\n",
	"bq_assistant.list_tables()[:10]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### View the available columns in a given table:\n",
	"\n",
	"This helped with knowing what to ask for in the queries below. Example here prints the columns for the 'target_dictionary' table."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 117,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Index(['tid', 'target_type', 'pref_name', 'tax_id', 'organism', 'chembl_id',\n",
	" 'species_group_flag'],\n",
	" dtype='object')"
	]
	},
	"execution_count": 117,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"\n",
	"# View the first three rows of the tissue_dictionary_23 data table\n",
	"b =bq_assistant.head(\"target_dictionary\", num_rows=3)\n",
	"b.columns"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Save tables for targets, activities, assays, molecules, and structures\n",
	"\n",
	"Uncomment the 'estimate_query' line to check how big the download will be (in GB) before actually saving it.\n",
	"There is a (large) limit of 5TB per month apparently. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 118,
	"metadata": {
	"scrolled": false
	},
	"outputs": [],
	"source": [
	"#Target dictionary:\n",
	"query1 = \"\"\"\n",
	"SELECT \n",
	" chembl_id, organism, pref_name, tid\n",
	"FROM\n",
	" `patents-public-data.ebi_chembl.target_dictionary`\n",
	"WHERE\n",
	" organism in ('Homo sapiens')\n",
	"\"\"\"\n",
	"\n",
	"#bq_assistant.estimate_query_size(query1)\n",
	"target_dict = ebi_chembl.query_to_pandas_safe(query1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 90,
	"metadata": {},
	"outputs": [],
	"source": [
	"##Activities:\n",
	"\n",
	"query1 = \"\"\"\n",
	"SELECT\n",
	" assay_id, molregno, activity_comment, pchembl_value\n",
	"FROM\n",
	" `patents-public-data.ebi_chembl.activities`\n",
	"WHERE\n",
	" activity_comment in ('Active', 'active', 'Partial agonist', 'Agonist', 'Antagonist') or pchembl_value>\"5\"\n",
	" \n",
	"\"\"\"\n",
	"\n",
	"#bq_assistant.estimate_query_size(query1)\n",
	"activities = ebi_chembl.query_to_pandas_safe(query1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 108,
	"metadata": {},
	"outputs": [],
	"source": [
	"\n",
	"##Assays\n",
	"\n",
	"query1 = \"\"\"\n",
	"SELECT\n",
	" chembl_id, assay_id, assay_organism, tid\n",
	"FROM\n",
	" `patents-public-data.ebi_chembl.assays`\n",
	"WHERE\n",
	" assay_organism in ('Homo sapiens')\n",
	"\n",
	"\"\"\"\n",
	"\n",
	"#bq_assistant.estimate_query_size(query1)\n",
	"assays = ebi_chembl.query_to_pandas_safe(query1)\n",
	" "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 94,
	"metadata": {},
	"outputs": [],
	"source": [
	"##Molecule dictionary:\n",
	"\n",
	"query1 = \"\"\"\n",
	"SELECT\n",
	" molregno, chembl_id\n",
	"FROM\n",
	" `patents-public-data.ebi_chembl.molecule_dictionary`\n",
	"\n",
	"\"\"\"\n",
	"#bq_assistant.estimate_query_size(query1)\n",
	"molecule_dict = ebi_chembl.query_to_pandas_safe(query1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 95,
	"metadata": {},
	"outputs": [],
	"source": [
	"##Compound structures:\n",
	"\n",
	"query1 = \"\"\"\n",
	"SELECT\n",
	" molregno, canonical_smiles\n",
	"FROM\n",
	" `patents-public-data.ebi_chembl.compound_structures`\n",
	"\n",
	"\"\"\"\n",
	"bq_assistant.estimate_query_size(query1)\n",
	"compound_structures = ebi_chembl.query_to_pandas_safe(query1)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Perform inner joins as you would in SQL:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 124,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = pd.merge(compound_structures, molecule_dict, on='molregno')\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 131,
	"metadata": {},
	"outputs": [],
	"source": [
	"df2 = pd.merge(df, activities, on='molregno')\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 140,
	"metadata": {},
	"outputs": [],
	"source": [
	"assays = assays.drop('chembl_id', axis=1)\n",
	"df3 = pd.merge(df2, assays, on='assay_id')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 143,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"target_dict = target_dict.drop('chembl_id', axis=1)\n",
	"df4 = pd.merge(df3, target_dict, on='tid')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Print out activites:\n",
	"The columns here contain enough data to parse and build a virtual screening or do any other analysis."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 146,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>molregno</th>\n",
	" <th>canonical_smiles</th>\n",
	" <th>chembl_id</th>\n",
	" <th>assay_id</th>\n",
	" <th>activity_comment</th>\n",
	" <th>pchembl_value</th>\n",
	" <th>assay_organism</th>\n",
	" <th>tid</th>\n",
	" <th>organism</th>\n",
	" <th>pref_name</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>97</td>\n",
	" <td>COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)c4occc4</td>\n",
	" <td>CHEMBL2</td>\n",
	" <td>774675</td>\n",
	" <td>None</td>\n",
	" <td>5.39</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>13000</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>Matrix metalloproteinase-1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>5985</td>\n",
	" <td>CCc1c2CN3C(=O)C4=C(C=C3c2nc5ccc(OC(=O)N6CCC(CC...</td>\n",
	" <td>CHEMBL481</td>\n",
	" <td>774675</td>\n",
	" <td>None</td>\n",
	" <td>5.22</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>13000</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>Matrix metalloproteinase-1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>33373</td>\n",
	" <td>COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)C4COc5cc...</td>\n",
	" <td>CHEMBL707</td>\n",
	" <td>774675</td>\n",
	" <td>None</td>\n",
	" <td>5.24</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>13000</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>Matrix metalloproteinase-1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>69926</td>\n",
	" <td>CC[C@H](N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)...</td>\n",
	" <td>CHEMBL47907</td>\n",
	" <td>101742</td>\n",
	" <td>None</td>\n",
	" <td>6.64</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>13000</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>Matrix metalloproteinase-1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>156907</td>\n",
	" <td>CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(C)C)N[C@...</td>\n",
	" <td>CHEMBL330729</td>\n",
	" <td>101742</td>\n",
	" <td>None</td>\n",
	" <td>6.82</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>13000</td>\n",
	" <td>Homo sapiens</td>\n",
	" <td>Matrix metalloproteinase-1</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" molregno canonical_smiles chembl_id \\\n",
	"0 97 COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)c4occc4 CHEMBL2 \n",
	"1 5985 CCc1c2CN3C(=O)C4=C(C=C3c2nc5ccc(OC(=O)N6CCC(CC... CHEMBL481 \n",
	"2 33373 COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)C4COc5cc... CHEMBL707 \n",
	"3 69926 CC[C@H](N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)... CHEMBL47907 \n",
	"4 156907 CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(C)C)N[C@... CHEMBL330729 \n",
	"\n",
	" assay_id activity_comment pchembl_value assay_organism tid organism \\\n",
	"0 774675 None 5.39 Homo sapiens 13000 Homo sapiens \n",
	"1 774675 None 5.22 Homo sapiens 13000 Homo sapiens \n",
	"2 774675 None 5.24 Homo sapiens 13000 Homo sapiens \n",
	"3 101742 None 6.64 Homo sapiens 13000 Homo sapiens \n",
	"4 101742 None 6.82 Homo sapiens 13000 Homo sapiens \n",
	"\n",
	" pref_name \n",
	"0 Matrix metalloproteinase-1 \n",
	"1 Matrix metalloproteinase-1 \n",
	"2 Matrix metalloproteinase-1 \n",
	"3 Matrix metalloproteinase-1 \n",
	"4 Matrix metalloproteinase-1 "
	]
	},
	"execution_count": 146,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df4.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}