Skip to content

Instantly share code, notes, and snippets.

@iwatobipen
Created April 8, 2022 13:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save iwatobipen/368807ec950126158f0eb33431305990 to your computer and use it in GitHub Desktop.
Save iwatobipen/368807ec950126158f0eb33431305990 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Find UniProt IDs in ChEMBL targets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ultimately, we are interested in getting [activity data from ChEMBl](/chembl-27/query_local_chembl-27.ipynb) we need to account for three components:\n",
"\n",
"* The compound being measured\n",
"* The target the compound binds to\n",
"* The assay where this measurement took place\n",
"\n",
"So, to find all activity data stored in ChEMBL that refers to kinases, we have to query for those assays annotated with a certain target.\n",
"\n",
"Each of those three components have a unique ChEMBL ID, but so far we only have obtained Uniprot IDs in the `human-kinases` notebook. We need a way to connect Uniprot IDs to ChEMBL target IDs. Fortunately, ChEMBL maintains such a map in their FTP releases. We will parse that file and convert it into a dataframe for easy manipulation."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import urllib.request\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"REPO = (Path(_dh[-1]) / \"..\").resolve()\n",
"DATA = REPO / 'data'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"CHEMBL_VERSION = 29\n",
"CHEMBL_VERSION = 30\n",
"url = fr\"ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_{CHEMBL_VERSION}/chembl_uniprot_mapping.txt\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>chembl_targets</th>\n",
" <th>description</th>\n",
" <th>type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>P21266</td>\n",
" <td>CHEMBL2242</td>\n",
" <td>Glutathione S-transferase Mu 3</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>O00519</td>\n",
" <td>CHEMBL2243</td>\n",
" <td>Anandamide amidohydrolase</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>P19217</td>\n",
" <td>CHEMBL2244</td>\n",
" <td>Estrogen sulfotransferase</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>P97292</td>\n",
" <td>CHEMBL2245</td>\n",
" <td>Histamine H2 receptor</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>P17342</td>\n",
" <td>CHEMBL2247</td>\n",
" <td>Atrial natriuretic peptide receptor C</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13526</th>\n",
" <td>A0A0J5PZ55</td>\n",
" <td>CHEMBL4630893</td>\n",
" <td>Dihydroorotate dehydrogenase (quinone), mitoch...</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13527</th>\n",
" <td>Q8NI17</td>\n",
" <td>CHEMBL4630894</td>\n",
" <td>Interleukin-31 receptor subunit alpha</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13528</th>\n",
" <td>P43630</td>\n",
" <td>CHEMBL4630895</td>\n",
" <td>Killer cell immunoglobulin-like receptor 3DL2</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13529</th>\n",
" <td>A0A0H3HP34</td>\n",
" <td>CHEMBL4630896</td>\n",
" <td>Enoyl-[acyl-carrier-protein] reductase [NADH]</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13530</th>\n",
" <td>P28887</td>\n",
" <td>CHEMBL4630897</td>\n",
" <td>RNA-directed RNA polymerase L</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>13531 rows × 4 columns</p>\n",
"</div>"
],
"text/plain": [
" UniprotID chembl_targets \\\n",
"0 P21266 CHEMBL2242 \n",
"1 O00519 CHEMBL2243 \n",
"2 P19217 CHEMBL2244 \n",
"3 P97292 CHEMBL2245 \n",
"4 P17342 CHEMBL2247 \n",
"... ... ... \n",
"13526 A0A0J5PZ55 CHEMBL4630893 \n",
"13527 Q8NI17 CHEMBL4630894 \n",
"13528 P43630 CHEMBL4630895 \n",
"13529 A0A0H3HP34 CHEMBL4630896 \n",
"13530 P28887 CHEMBL4630897 \n",
"\n",
" description type \n",
"0 Glutathione S-transferase Mu 3 SINGLE PROTEIN \n",
"1 Anandamide amidohydrolase SINGLE PROTEIN \n",
"2 Estrogen sulfotransferase SINGLE PROTEIN \n",
"3 Histamine H2 receptor SINGLE PROTEIN \n",
"4 Atrial natriuretic peptide receptor C SINGLE PROTEIN \n",
"... ... ... \n",
"13526 Dihydroorotate dehydrogenase (quinone), mitoch... SINGLE PROTEIN \n",
"13527 Interleukin-31 receptor subunit alpha SINGLE PROTEIN \n",
"13528 Killer cell immunoglobulin-like receptor 3DL2 SINGLE PROTEIN \n",
"13529 Enoyl-[acyl-carrier-protein] reductase [NADH] SINGLE PROTEIN \n",
"13530 RNA-directed RNA polymerase L SINGLE PROTEIN \n",
"\n",
"[13531 rows x 4 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with urllib.request.urlopen(url) as response:\n",
" uniprot_map = pd.read_csv(response, sep=\"\\t\", skiprows=[0], names=[\"UniprotID\", \"chembl_targets\", \"description\", \"type\"])\n",
"uniprot_map"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We join this new information to the human kinases aggregated list from `human-kinases` (all of them, regardless the source):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>Name</th>\n",
" <th>kinhub</th>\n",
" <th>klifs</th>\n",
" <th>pkinfam</th>\n",
" <th>reviewed_uniprot</th>\n",
" <th>dunbrack_msa</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0A0B4J2F2</td>\n",
" <td>SIK1B</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A4QPH2</td>\n",
" <td>PI4KAP2</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>B5MCJ9</td>\n",
" <td>TRIM66</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>O00141</td>\n",
" <td>SGK1</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>O00238</td>\n",
" <td>BMPR1B|BMR1B</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>547</th>\n",
" <td>Q9Y616</td>\n",
" <td>IRAK3</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>548</th>\n",
" <td>Q9Y6E0</td>\n",
" <td>STK24</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>549</th>\n",
" <td>Q9Y6M4</td>\n",
" <td>CSNK1G3|KC1G3</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>550</th>\n",
" <td>Q9Y6R4</td>\n",
" <td>M3K4|MAP3K4</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>551</th>\n",
" <td>Q9Y6S9</td>\n",
" <td>RPS6KL1|RPKL1</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>552 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" UniprotID Name kinhub klifs pkinfam reviewed_uniprot \\\n",
"0 A0A0B4J2F2 SIK1B False False True True \n",
"1 A4QPH2 PI4KAP2 False True False False \n",
"2 B5MCJ9 TRIM66 True False False False \n",
"3 O00141 SGK1 True True True True \n",
"4 O00238 BMPR1B|BMR1B True True True True \n",
".. ... ... ... ... ... ... \n",
"547 Q9Y616 IRAK3 True True True True \n",
"548 Q9Y6E0 STK24 True True True True \n",
"549 Q9Y6M4 CSNK1G3|KC1G3 True True True True \n",
"550 Q9Y6R4 M3K4|MAP3K4 True True True True \n",
"551 Q9Y6S9 RPS6KL1|RPKL1 True True True True \n",
"\n",
" dunbrack_msa \n",
"0 True \n",
"1 False \n",
"2 False \n",
"3 True \n",
"4 True \n",
".. ... \n",
"547 True \n",
"548 True \n",
"549 True \n",
"550 True \n",
"551 True \n",
"\n",
"[552 rows x 7 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kinases = pd.read_csv(DATA / \"human_kinases.aggregated.csv\", index_col=0)\n",
"kinases"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are only interested in those kinases present in these datasets:\n",
"\n",
"* KinHub\n",
"* KLIFS\n",
"* PKinFam\n",
"* Dunbrack's MSA"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>Name</th>\n",
" <th>kinhub</th>\n",
" <th>klifs</th>\n",
" <th>pkinfam</th>\n",
" <th>reviewed_uniprot</th>\n",
" <th>dunbrack_msa</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0A0B4J2F2</td>\n",
" <td>SIK1B</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A4QPH2</td>\n",
" <td>PI4KAP2</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>B5MCJ9</td>\n",
" <td>TRIM66</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>O00141</td>\n",
" <td>SGK1</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>O00238</td>\n",
" <td>BMPR1B|BMR1B</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>547</th>\n",
" <td>Q9Y616</td>\n",
" <td>IRAK3</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>548</th>\n",
" <td>Q9Y6E0</td>\n",
" <td>STK24</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>549</th>\n",
" <td>Q9Y6M4</td>\n",
" <td>CSNK1G3|KC1G3</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>550</th>\n",
" <td>Q9Y6R4</td>\n",
" <td>M3K4|MAP3K4</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>551</th>\n",
" <td>Q9Y6S9</td>\n",
" <td>RPS6KL1|RPKL1</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>552 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" UniprotID Name kinhub klifs pkinfam reviewed_uniprot \\\n",
"0 A0A0B4J2F2 SIK1B False False True True \n",
"1 A4QPH2 PI4KAP2 False True False False \n",
"2 B5MCJ9 TRIM66 True False False False \n",
"3 O00141 SGK1 True True True True \n",
"4 O00238 BMPR1B|BMR1B True True True True \n",
".. ... ... ... ... ... ... \n",
"547 Q9Y616 IRAK3 True True True True \n",
"548 Q9Y6E0 STK24 True True True True \n",
"549 Q9Y6M4 CSNK1G3|KC1G3 True True True True \n",
"550 Q9Y6R4 M3K4|MAP3K4 True True True True \n",
"551 Q9Y6S9 RPS6KL1|RPKL1 True True True True \n",
"\n",
" dunbrack_msa \n",
"0 True \n",
"1 False \n",
"2 False \n",
"3 True \n",
"4 True \n",
".. ... \n",
"547 True \n",
"548 True \n",
"549 True \n",
"550 True \n",
"551 True \n",
"\n",
"[552 rows x 7 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kinases_subset = kinases[kinases[[\"kinhub\", \"klifs\", \"pkinfam\", \"dunbrack_msa\"]].sum(axis=1) > 0]\n",
"kinases_subset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We would also like to preserve the provenance of the Uniprot assignment, so we will group the provenance columns in a single one now."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>Name</th>\n",
" <th>kinhub</th>\n",
" <th>klifs</th>\n",
" <th>pkinfam</th>\n",
" <th>reviewed_uniprot</th>\n",
" <th>dunbrack_msa</th>\n",
" <th>origin</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0A0B4J2F2</td>\n",
" <td>SIK1B</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>pkinfam|reviewed_uniprot|dunbrack_msa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A4QPH2</td>\n",
" <td>PI4KAP2</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>klifs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>B5MCJ9</td>\n",
" <td>TRIM66</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>kinhub</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>O00141</td>\n",
" <td>SGK1</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>O00238</td>\n",
" <td>BMPR1B|BMR1B</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>547</th>\n",
" <td>Q9Y616</td>\n",
" <td>IRAK3</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>548</th>\n",
" <td>Q9Y6E0</td>\n",
" <td>STK24</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>549</th>\n",
" <td>Q9Y6M4</td>\n",
" <td>CSNK1G3|KC1G3</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>550</th>\n",
" <td>Q9Y6R4</td>\n",
" <td>M3K4|MAP3K4</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>551</th>\n",
" <td>Q9Y6S9</td>\n",
" <td>RPS6KL1|RPKL1</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>552 rows × 8 columns</p>\n",
"</div>"
],
"text/plain": [
" UniprotID Name kinhub klifs pkinfam reviewed_uniprot \\\n",
"0 A0A0B4J2F2 SIK1B False False True True \n",
"1 A4QPH2 PI4KAP2 False True False False \n",
"2 B5MCJ9 TRIM66 True False False False \n",
"3 O00141 SGK1 True True True True \n",
"4 O00238 BMPR1B|BMR1B True True True True \n",
".. ... ... ... ... ... ... \n",
"547 Q9Y616 IRAK3 True True True True \n",
"548 Q9Y6E0 STK24 True True True True \n",
"549 Q9Y6M4 CSNK1G3|KC1G3 True True True True \n",
"550 Q9Y6R4 M3K4|MAP3K4 True True True True \n",
"551 Q9Y6S9 RPS6KL1|RPKL1 True True True True \n",
"\n",
" dunbrack_msa origin \n",
"0 True pkinfam|reviewed_uniprot|dunbrack_msa \n",
"1 False klifs \n",
"2 False kinhub \n",
"3 True kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"4 True kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
".. ... ... \n",
"547 True kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"548 True kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"549 True kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"550 True kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"551 True kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"\n",
"[552 rows x 8 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kinases_subset[\"origin\"] = kinases_subset.apply(lambda s: '|'.join([k for k in [\"kinhub\", \"klifs\", \"pkinfam\", \"reviewed_uniprot\", \"dunbrack_msa\"] if s[k]]), axis=1)\n",
"kinases_subset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now merge the needed columns based on the `UniprotID` key."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>Name</th>\n",
" <th>chembl_targets</th>\n",
" <th>type</th>\n",
" <th>origin</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A4QPH2</td>\n",
" <td>PI4KAP2</td>\n",
" <td>CHEMBL4105789</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>klifs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A4QPH2</td>\n",
" <td>PI4KAP2</td>\n",
" <td>CHEMBL3038509</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>klifs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>O00141</td>\n",
" <td>SGK1</td>\n",
" <td>CHEMBL2343</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>O00238</td>\n",
" <td>BMPR1B|BMR1B</td>\n",
" <td>CHEMBL5476</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>O00311</td>\n",
" <td>CDC7</td>\n",
" <td>CHEMBL2111377</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1014</th>\n",
" <td>Q9Y616</td>\n",
" <td>IRAK3</td>\n",
" <td>CHEMBL4748234</td>\n",
" <td>PROTEIN-PROTEIN INTERACTION</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1015</th>\n",
" <td>Q9Y616</td>\n",
" <td>IRAK3</td>\n",
" <td>CHEMBL4742326</td>\n",
" <td>PROTEIN-PROTEIN INTERACTION</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1016</th>\n",
" <td>Q9Y6E0</td>\n",
" <td>STK24</td>\n",
" <td>CHEMBL5082</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1017</th>\n",
" <td>Q9Y6M4</td>\n",
" <td>CSNK1G3|KC1G3</td>\n",
" <td>CHEMBL5084</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1018</th>\n",
" <td>Q9Y6R4</td>\n",
" <td>M3K4|MAP3K4</td>\n",
" <td>CHEMBL4853</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1019 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" UniprotID Name chembl_targets type \\\n",
"0 A4QPH2 PI4KAP2 CHEMBL4105789 SINGLE PROTEIN \n",
"1 A4QPH2 PI4KAP2 CHEMBL3038509 PROTEIN COMPLEX \n",
"2 O00141 SGK1 CHEMBL2343 SINGLE PROTEIN \n",
"3 O00238 BMPR1B|BMR1B CHEMBL5476 SINGLE PROTEIN \n",
"4 O00311 CDC7 CHEMBL2111377 PROTEIN COMPLEX \n",
"... ... ... ... ... \n",
"1014 Q9Y616 IRAK3 CHEMBL4748234 PROTEIN-PROTEIN INTERACTION \n",
"1015 Q9Y616 IRAK3 CHEMBL4742326 PROTEIN-PROTEIN INTERACTION \n",
"1016 Q9Y6E0 STK24 CHEMBL5082 SINGLE PROTEIN \n",
"1017 Q9Y6M4 CSNK1G3|KC1G3 CHEMBL5084 SINGLE PROTEIN \n",
"1018 Q9Y6R4 M3K4|MAP3K4 CHEMBL4853 SINGLE PROTEIN \n",
"\n",
" origin \n",
"0 klifs \n",
"1 klifs \n",
"2 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"3 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"4 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"... ... \n",
"1014 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1015 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1016 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1017 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1018 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"\n",
"[1019 rows x 5 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged = pd.merge(kinases_subset[[\"UniprotID\", \"Name\", \"origin\"]], uniprot_map[[\"UniprotID\", \"chembl_targets\", \"type\"]], how=\"inner\", on='UniprotID')[[\"UniprotID\", \"Name\", \"chembl_targets\", \"type\", \"origin\"]]\n",
"merged"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"~~How is this possible? 969 targets (ChEMBL 28)?!~~\n",
"\n",
"Apparently, there's not 1:1 correspondence between UniprotID and ChEMBL ID! Some Uniprot IDs are included in several ChEMBL targets:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"P11802 17\n",
"P24941 15\n",
"P35968 11\n",
"Q00534 11\n",
"P06493 11\n",
" ..\n",
"O75962 1\n",
"Q59H18 1\n",
"Q5JZY3 1\n",
"Q5MAI5 1\n",
"Q9Y6R4 1\n",
"Name: UniprotID, Length: 496, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged.UniprotID.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>Name</th>\n",
" <th>chembl_targets</th>\n",
" <th>type</th>\n",
" <th>origin</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>221</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL331</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>222</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL2095942</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>223</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL2111326</td>\n",
" <td>SELECTIVITY GROUP</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>224</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL1907601</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>225</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL3301385</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>226</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL4523686</td>\n",
" <td>PROTEIN-PROTEIN INTERACTION</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL4523715</td>\n",
" <td>PROTEIN-PROTEIN INTERACTION</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>228</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL4523732</td>\n",
" <td>PROTEIN-PROTEIN INTERACTION</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>229</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL3038472</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL4523963</td>\n",
" <td>SELECTIVITY GROUP</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>231</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL3559691</td>\n",
" <td>PROTEIN FAMILY</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>232</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL3038517</td>\n",
" <td>PROTEIN FAMILY</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>233</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL4106184</td>\n",
" <td>PROTEIN FAMILY</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>234</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL4630748</td>\n",
" <td>PROTEIN-PROTEIN INTERACTION</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>235</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL3885548</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>236</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL3885553</td>\n",
" <td>PROTEIN FAMILY</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>237</th>\n",
" <td>P11802</td>\n",
" <td>CDK4</td>\n",
" <td>CHEMBL3885554</td>\n",
" <td>PROTEIN COMPLEX</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" UniprotID Name chembl_targets type \\\n",
"221 P11802 CDK4 CHEMBL331 SINGLE PROTEIN \n",
"222 P11802 CDK4 CHEMBL2095942 PROTEIN COMPLEX \n",
"223 P11802 CDK4 CHEMBL2111326 SELECTIVITY GROUP \n",
"224 P11802 CDK4 CHEMBL1907601 PROTEIN COMPLEX \n",
"225 P11802 CDK4 CHEMBL3301385 PROTEIN COMPLEX \n",
"226 P11802 CDK4 CHEMBL4523686 PROTEIN-PROTEIN INTERACTION \n",
"227 P11802 CDK4 CHEMBL4523715 PROTEIN-PROTEIN INTERACTION \n",
"228 P11802 CDK4 CHEMBL4523732 PROTEIN-PROTEIN INTERACTION \n",
"229 P11802 CDK4 CHEMBL3038472 PROTEIN COMPLEX \n",
"230 P11802 CDK4 CHEMBL4523963 SELECTIVITY GROUP \n",
"231 P11802 CDK4 CHEMBL3559691 PROTEIN FAMILY \n",
"232 P11802 CDK4 CHEMBL3038517 PROTEIN FAMILY \n",
"233 P11802 CDK4 CHEMBL4106184 PROTEIN FAMILY \n",
"234 P11802 CDK4 CHEMBL4630748 PROTEIN-PROTEIN INTERACTION \n",
"235 P11802 CDK4 CHEMBL3885548 PROTEIN COMPLEX \n",
"236 P11802 CDK4 CHEMBL3885553 PROTEIN FAMILY \n",
"237 P11802 CDK4 CHEMBL3885554 PROTEIN COMPLEX \n",
"\n",
" origin \n",
"221 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"222 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"223 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"224 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"225 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"226 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"227 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"228 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"229 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"230 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"231 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"232 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"233 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"234 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"235 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"236 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"237 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged[merged.UniprotID == \"P11802\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... and some ChEMBL targets include several kinases (e.g. chimeric proteins):"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>Name</th>\n",
" <th>chembl_targets</th>\n",
" <th>type</th>\n",
" <th>origin</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>113</th>\n",
" <td>P00519</td>\n",
" <td>ABL1</td>\n",
" <td>CHEMBL2096618</td>\n",
" <td>CHIMERIC PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>209</th>\n",
" <td>P11274</td>\n",
" <td>BCR</td>\n",
" <td>CHEMBL2096618</td>\n",
" <td>CHIMERIC PROTEIN</td>\n",
" <td>kinhub|klifs</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" UniprotID Name chembl_targets type \\\n",
"113 P00519 ABL1 CHEMBL2096618 CHIMERIC PROTEIN \n",
"209 P11274 BCR CHEMBL2096618 CHIMERIC PROTEIN \n",
"\n",
" origin \n",
"113 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"209 kinhub|klifs "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged[merged.chembl_targets == \"CHEMBL2096618\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is due to the different `type` values:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SINGLE PROTEIN 494\n",
"PROTEIN FAMILY 247\n",
"PROTEIN COMPLEX 121\n",
"PROTEIN-PROTEIN INTERACTION 108\n",
"SELECTIVITY GROUP 22\n",
"CHIMERIC PROTEIN 16\n",
"PROTEIN COMPLEX GROUP 11\n",
"Name: type, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged.type.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we focus on `SINGLE PROTEIN` types:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UniprotID</th>\n",
" <th>Name</th>\n",
" <th>chembl_targets</th>\n",
" <th>type</th>\n",
" <th>origin</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A4QPH2</td>\n",
" <td>PI4KAP2</td>\n",
" <td>CHEMBL4105789</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>klifs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>O00141</td>\n",
" <td>SGK1</td>\n",
" <td>CHEMBL2343</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>O00238</td>\n",
" <td>BMPR1B|BMR1B</td>\n",
" <td>CHEMBL5476</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>O00311</td>\n",
" <td>CDC7</td>\n",
" <td>CHEMBL5443</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>O00329</td>\n",
" <td>PIK3CD</td>\n",
" <td>CHEMBL3130</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>klifs|pkinfam</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1012</th>\n",
" <td>Q9Y5S2</td>\n",
" <td>MRCKB|CDC42BPB</td>\n",
" <td>CHEMBL5052</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1013</th>\n",
" <td>Q9Y616</td>\n",
" <td>IRAK3</td>\n",
" <td>CHEMBL5081</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1016</th>\n",
" <td>Q9Y6E0</td>\n",
" <td>STK24</td>\n",
" <td>CHEMBL5082</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1017</th>\n",
" <td>Q9Y6M4</td>\n",
" <td>CSNK1G3|KC1G3</td>\n",
" <td>CHEMBL5084</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1018</th>\n",
" <td>Q9Y6R4</td>\n",
" <td>M3K4|MAP3K4</td>\n",
" <td>CHEMBL4853</td>\n",
" <td>SINGLE PROTEIN</td>\n",
" <td>kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>494 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" UniprotID Name chembl_targets type \\\n",
"0 A4QPH2 PI4KAP2 CHEMBL4105789 SINGLE PROTEIN \n",
"2 O00141 SGK1 CHEMBL2343 SINGLE PROTEIN \n",
"3 O00238 BMPR1B|BMR1B CHEMBL5476 SINGLE PROTEIN \n",
"5 O00311 CDC7 CHEMBL5443 SINGLE PROTEIN \n",
"7 O00329 PIK3CD CHEMBL3130 SINGLE PROTEIN \n",
"... ... ... ... ... \n",
"1012 Q9Y5S2 MRCKB|CDC42BPB CHEMBL5052 SINGLE PROTEIN \n",
"1013 Q9Y616 IRAK3 CHEMBL5081 SINGLE PROTEIN \n",
"1016 Q9Y6E0 STK24 CHEMBL5082 SINGLE PROTEIN \n",
"1017 Q9Y6M4 CSNK1G3|KC1G3 CHEMBL5084 SINGLE PROTEIN \n",
"1018 Q9Y6R4 M3K4|MAP3K4 CHEMBL4853 SINGLE PROTEIN \n",
"\n",
" origin \n",
"0 klifs \n",
"2 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"3 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"5 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"7 klifs|pkinfam \n",
"... ... \n",
"1012 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1013 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1016 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1017 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"1018 kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack... \n",
"\n",
"[494 rows x 5 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged[merged.type == \"SINGLE PROTEIN\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... we end up with a total of 491 targets (ChEMBL 28), which is more acceptable.\n",
"\n",
"For that reason, we will only save records corresponding to `type == SINGLE PROTEIN`"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"merged[merged.type == \"SINGLE PROTEIN\"].to_csv(DATA / f\"human_kinases_and_chembl_targets.chembl_{CHEMBL_VERSION}.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "0300e5a8eaa00f483f78a74a63703ce4e5f45ef77276e605540748ee734ff974"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment