Skip to content

Instantly share code, notes, and snippets.

@DSuveges
Created July 18, 2023 15:00
Show Gist options
  • Save DSuveges/744ee414440645f9451b759c5a6b428a to your computer and use it in GitHub Desktop.
Save DSuveges/744ee414440645f9451b759c5a6b428a to your computer and use it in GitHub Desktop.
GCS/Updating STRING.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"id": "56e12f81",
"cell_type": "markdown",
"source": "# Updating STRING\n\nThe STRING database has been updated to v.12.0, which is now available on a release specific website before it got fully released.\n\n- New URL: https://version-12-0.string-db.org/ (v.12.0)\n- Old URL: https://string-db.org/ (v.11.5)\n\nThe data is avilable through API, UI and downloadable files. \n\n**The aim of this investigation**\n\n- Identify if the data has changed. \n- Identify if the data differs from what we have on production.\n- Identify relevant downloadable files and columns to extract data from.\n- Prototype parser to generate dataset.\n\n\n## Conclusions\n\n- There are two websites, with two API urls. \n- However the version endpoint of the API shows the same version (v.11.5).\n- However the version is the same, the returned data is different. :D \n- The OT Platform STRING data is not consistent with either version.\n- Interaction scores shown on the Platform is consistent with `gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz` \n- Investigation of the original ticket ([#1509](https://github.com/opentargets/issues/issues/1509)) shows the actual platform dataset is based on v.11.0. \n- When updated to v.11.5, the data is good and consistent with the API.\n- I could updated the file to v.12.0, and is consistent with the new API. File is uploaded to the bucket.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Testing out the two versions of APIs:"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:35:19.795313Z",
"start_time": "2023-07-18T09:35:19.608673Z"
},
"trusted": false
},
"id": "71587ca0",
"cell_type": "code",
"source": "import pandas as pd\nfrom requests import get, post\n\nold_api_url = 'https://string-db.org/api'\nnew_api_url = 'https://version-12-0.string-db.org/api'\n\n# Get release version:\nversion = 'json/version'\n\nnew_version = get(f'{new_api_url}/{version}').json()[0]['string_version']\nold_version = get(f'{old_api_url}/{version}').json()[0]['string_version']\n\nprint(f'New version: {new_version}')\nprint(f'Old version: {old_version}')",
"execution_count": 20,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "New version: 11.5\nOld version: 11.5\n"
}
]
},
{
"metadata": {},
"id": "ca71e070",
"cell_type": "markdown",
"source": "Retrieve the top 100 interaction partners for TP53:"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:13:49.860195Z",
"start_time": "2023-07-18T09:13:49.512999Z"
},
"trusted": false
},
"id": "b241cf5c",
"cell_type": "code",
"source": "top_count = 100\nprotein = 'TP53'\nnew_url = f'{new_api_url}/json/interaction_partners?identifiers={protein}&limit={top_count}'\n\nnew_data = (\n pd.read_json(new_url)\n .drop(['preferredName_A', 'ncbiTaxonId'], axis=1)\n .rename(\n columns={\n col: f'{col}_new'\n for col in 'preferredName_B score nscore fscore pscore ascore escore dscore tscore'.split(' ')\n }\n )\n)\n\nold_data = (\n pd.read_json(f'{old_api_url}/json/interaction_partners?identifiers={protein}&limit={top_count}')\n .drop(['preferredName_A', 'ncbiTaxonId'], axis=1)\n .rename(\n columns={\n col: f'{col}_old'\n for col in 'preferredName_B score nscore fscore pscore ascore escore dscore tscore'.split(' ')\n }\n )\n)\n\nnew_data.head()",
"execution_count": 12,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>preferredName_B_new</th>\n <th>score_new</th>\n <th>nscore_new</th>\n <th>fscore_new</th>\n <th>pscore_new</th>\n <th>ascore_new</th>\n <th>escore_new</th>\n <th>dscore_new</th>\n <th>tscore_new</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000340989</td>\n <td>SFN</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>0.981</td>\n <td>0.75</td>\n <td>0.859</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000263253</td>\n <td>EP300</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.049</td>\n <td>0.999</td>\n <td>0.90</td>\n <td>0.998</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000437955</td>\n <td>HIF1A</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>0.847</td>\n <td>0.00</td>\n <td>0.994</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000362649</td>\n <td>HDAC1</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.109</td>\n <td>0.924</td>\n <td>0.50</td>\n <td>0.994</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000335153</td>\n <td>HSP90AA1</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>0.903</td>\n <td>0.00</td>\n <td>0.995</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " stringId_A stringId_B preferredName_B_new score_new \\\n0 9606.ENSP00000269305 9606.ENSP00000340989 SFN 0.999 \n1 9606.ENSP00000269305 9606.ENSP00000263253 EP300 0.999 \n2 9606.ENSP00000269305 9606.ENSP00000437955 HIF1A 0.999 \n3 9606.ENSP00000269305 9606.ENSP00000362649 HDAC1 0.999 \n4 9606.ENSP00000269305 9606.ENSP00000335153 HSP90AA1 0.999 \n\n nscore_new fscore_new pscore_new ascore_new escore_new dscore_new \\\n0 0 0 0.0 0.000 0.981 0.75 \n1 0 0 0.0 0.049 0.999 0.90 \n2 0 0 0.0 0.000 0.847 0.00 \n3 0 0 0.0 0.109 0.924 0.50 \n4 0 0 0.0 0.000 0.903 0.00 \n\n tscore_new \n0 0.859 \n1 0.998 \n2 0.994 \n3 0.994 \n4 0.995 "
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:20:19.167532Z",
"start_time": "2023-07-18T09:20:19.148396Z"
},
"trusted": false
},
"id": "e88a5e7b",
"cell_type": "code",
"source": "merged = (\n new_data\n .merge(\n old_data, on=['stringId_A', 'stringId_B'], how='inner'\n )\n)\n\nprint(len(merged))\nmerged.query('score_new != score_old')[['stringId_A', 'stringId_B', 'score_new', 'score_old']].head()",
"execution_count": 15,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "59\n"
},
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>score_new</th>\n <th>score_old</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>7</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000372023</td>\n <td>0.999</td>\n <td>0.998</td>\n </tr>\n <tr>\n <th>8</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000381185</td>\n <td>0.999</td>\n <td>0.995</td>\n </tr>\n <tr>\n <th>21</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000321239</td>\n <td>0.998</td>\n <td>0.999</td>\n </tr>\n <tr>\n <th>22</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000387699</td>\n <td>0.998</td>\n <td>0.994</td>\n </tr>\n <tr>\n <th>23</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000347184</td>\n <td>0.998</td>\n <td>0.997</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " stringId_A stringId_B score_new score_old\n7 9606.ENSP00000269305 9606.ENSP00000372023 0.999 0.998\n8 9606.ENSP00000269305 9606.ENSP00000381185 0.999 0.995\n21 9606.ENSP00000269305 9606.ENSP00000321239 0.998 0.999\n22 9606.ENSP00000269305 9606.ENSP00000387699 0.998 0.994\n23 9606.ENSP00000269305 9606.ENSP00000347184 0.998 0.997"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:23:40.396159Z",
"start_time": "2023-07-18T09:23:40.391726Z"
},
"trusted": false
},
"id": "bafc07ff",
"cell_type": "code",
"source": "new_url",
"execution_count": 18,
"outputs": [
{
"data": {
"text/plain": "'https://version-12-0.string-db.org/api/json/interaction_partners?identifiers=TP53&limit=100'"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:20:57.186437Z",
"start_time": "2023-07-18T09:20:57.100738Z"
},
"trusted": false
},
"id": "02fa1f79",
"cell_type": "code",
"source": "get(f'{old_api_url}/{version}').json()",
"execution_count": 17,
"outputs": [
{
"data": {
"text/plain": "[{'string_version': '11.5',\n 'stable_address': 'https://version-11-5.string-db.org'}]"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:29:44.647608Z",
"start_time": "2023-07-18T09:29:44.383483Z"
},
"trusted": false
},
"id": "8ef061c2",
"cell_type": "code",
"source": "import json \nprint(json.dumps(get(f'{old_api_url}/{version}').json(), indent=2))",
"execution_count": 19,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[\n {\n \"string_version\": \"11.5\",\n \"stable_address\": \"https://version-11-5.string-db.org\"\n }\n]\n"
}
]
},
{
"metadata": {},
"id": "d679994d",
"cell_type": "markdown",
"source": "## Extract string interactions from OT platform"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:46:39.585219Z",
"start_time": "2023-07-18T09:46:34.372824Z"
},
"trusted": false
},
"id": "306f5951",
"cell_type": "code",
"source": "query = '''query InteractionsSectionQuery($ensgId: String!, $sourceDatabase: String, $index: Int = 0, $size: Int = 10) {\n target(ensemblId: $ensgId) {\n id\n interactions(\n sourceDatabase: $sourceDatabase\n page: {index: $index, size: $size}\n ) {\n rows {\n intA\n intB\n targetB {\n approvedSymbol\n id\n }\n score\n evidences {\n evidenceScore\n interactionDetectionMethodShortName\n }\n }\n }\n }\n}\n'''\n\n# params = {\n# \"index\": 0,\n# \"size\": 10000,\n# \"ensgId\": \"ENSG00000171388\",\n# \"sourceDatabase\": \"string\"\n# }\n\n\nfull_query = {\n \"operationName\": \"InteractionsSectionQuery\",\n \"variables\": {\n \"index\": 0,\n \"size\": 10000,\n \"ensgId\": \"ENSG00000141510\",\n \"sourceDatabase\": \"string\"\n },\n \"query\": query\n}\n\nplatform_api_url = 'https://api.platform.opentargets.org/api/v4/graphql'\ndata = post(platform_api_url, json=full_query)\n\n",
"execution_count": 35,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:46:39.645278Z",
"start_time": "2023-07-18T09:46:39.589141Z"
},
"trusted": false
},
"id": "9c197097",
"cell_type": "code",
"source": "df_platform = pd.DataFrame([\n {\n 'stringId_A': f\"9606.{row['intA']}\",\n 'stringId_B': f\"9606.{row['intB']}\",\n 'score_platform': row['score']\n }\n for row in data.json()['data']['target']['interactions']['rows']\n])\n\ndf_platform.head()",
"execution_count": 36,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>score_platform</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000384849</td>\n <td>0.999</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000278616</td>\n <td>0.999</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000262367</td>\n <td>0.999</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000258149</td>\n <td>0.999</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000356150</td>\n <td>0.999</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " stringId_A stringId_B score_platform\n0 9606.ENSP00000269305 9606.ENSP00000384849 0.999\n1 9606.ENSP00000269305 9606.ENSP00000278616 0.999\n2 9606.ENSP00000269305 9606.ENSP00000262367 0.999\n3 9606.ENSP00000269305 9606.ENSP00000258149 0.999\n4 9606.ENSP00000269305 9606.ENSP00000356150 0.999"
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:48:17.391476Z",
"start_time": "2023-07-18T09:48:17.375655Z"
},
"trusted": false
},
"id": "07b63454",
"cell_type": "code",
"source": "merged = (\n new_data\n .merge(\n old_data, on=['stringId_A', 'stringId_B'], how='outer'\n )\n .merge(\n df_platform, on=['stringId_A', 'stringId_B'], how='left'\n )\n [['stringId_A', 'stringId_B', 'score_old', 'score_new', 'score_platform']]\n)\n\nlen(merged)",
"execution_count": 40,
"outputs": [
{
"data": {
"text/plain": "141"
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:52:52.744108Z",
"start_time": "2023-07-18T09:52:48.868389Z"
},
"trusted": false
},
"id": "9735b399",
"cell_type": "code",
"source": "from pyspark.sql import SparkSession, functions as f, types as t\nspark = SparkSession.builder.getOrCreate()\nmerged.iteritems = merged.items\n\ndf = spark.createDataFrame(merged).persist()\ndf.show()",
"execution_count": 47,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": " \r"
},
{
"name": "stdout",
"output_type": "stream",
"text": "+--------------------+--------------------+---------+---------+--------------+\n| stringId_A| stringId_B|score_old|score_new|score_platform|\n+--------------------+--------------------+---------+---------+--------------+\n|9606.ENSP00000269305|9606.ENSP00000340989| 0.999| 0.999| 0.986|\n|9606.ENSP00000269305|9606.ENSP00000263253| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000437955| 0.999| 0.999| 0.977|\n|9606.ENSP00000269305|9606.ENSP00000362649| 0.999| 0.999| 0.99|\n|9606.ENSP00000269305|9606.ENSP00000335153| 0.999| 0.999| 0.99|\n|9606.ENSP00000269305|9606.ENSP00000278616| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000356150| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000372023| 0.998| 0.999| 0.998|\n|9606.ENSP00000269305|9606.ENSP00000381185| 0.995| 0.999| 0.984|\n|9606.ENSP00000269305|9606.ENSP00000365230| NaN| 0.999| NaN|\n|9606.ENSP00000269305|9606.ENSP00000354218| NaN| 0.999| NaN|\n|9606.ENSP00000269305|9606.ENSP00000266000| NaN| 0.999| NaN|\n|9606.ENSP00000269305|9606.ENSP00000212015| 0.999| 0.999| 0.996|\n|9606.ENSP00000269305|9606.ENSP00000418960| 0.999| 0.999| 0.996|\n|9606.ENSP00000269305|9606.ENSP00000384849| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000341957| 0.999| 0.999| 0.998|\n|9606.ENSP00000269305|9606.ENSP00000497594| NaN| 0.999| NaN|\n|9606.ENSP00000269305|9606.ENSP00000343535| 0.999| 0.999| 0.987|\n|9606.ENSP00000269305|9606.ENSP00000262367| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000258149| 0.999| 0.999| 0.999|\n+--------------------+--------------------+---------+---------+--------------+\nonly showing top 20 rows\n\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-07-18T09:52:41.622954Z",
"start_time": "2023-07-18T09:52:39.301406Z"
},
"trusted": false
},
"id": "860f587f",
"cell_type": "code",
"source": "merged.iteritems = merged.items\nspark.createDataFrame(merged)",
"execution_count": 46,
"outputs": [
{
"data": {
"text/plain": "DataFrame[stringId_A: string, stringId_B: string, score_old: double, score_new: double, score_platform: double]"
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T10:20:16.135172Z",
"end_time": "2023-07-18T10:21:01.145258Z"
},
"trusted": true
},
"id": "9d64eba6",
"cell_type": "code",
"source": "merged_full = (\n df\n .join(\n (\n spark.read.csv('gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz', sep=' ', header=True)\n .select(\n f.col('protein1').alias('stringId_A'),\n f.col('protein2').alias('stringId_B'),\n (f.col('combined_score') / 1000).alias('score_flatfile')\n )\n ), \n on=['stringId_A', 'stringId_B'], \n how='inner'\n )\n .persist()\n)\n\nmerged_full.show()",
"execution_count": 55,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 9:> (0 + 1) / 1]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+--------------------+--------------------+---------+---------+--------------+--------------+\n| stringId_A| stringId_B|score_old|score_new|score_platform|score_flatfile|\n+--------------------+--------------------+---------+---------+--------------+--------------+\n|9606.ENSP00000269305|9606.ENSP00000340989| 0.999| 0.999| 0.986| 0.986|\n|9606.ENSP00000269305|9606.ENSP00000263253| 0.999| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000437955| 0.999| 0.999| 0.977| 0.977|\n|9606.ENSP00000269305|9606.ENSP00000362649| 0.999| 0.999| 0.99| 0.99|\n|9606.ENSP00000269305|9606.ENSP00000335153| 0.999| 0.999| 0.99| 0.99|\n|9606.ENSP00000269305|9606.ENSP00000278616| 0.999| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000356150| 0.999| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000372023| 0.998| 0.999| 0.998| 0.998|\n|9606.ENSP00000269305|9606.ENSP00000381185| 0.995| 0.999| 0.984| 0.984|\n|9606.ENSP00000269305|9606.ENSP00000212015| 0.999| 0.999| 0.996| 0.996|\n|9606.ENSP00000269305|9606.ENSP00000418960| 0.999| 0.999| 0.996| 0.996|\n|9606.ENSP00000269305|9606.ENSP00000384849| 0.999| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000341957| 0.999| 0.999| 0.998| 0.998|\n|9606.ENSP00000269305|9606.ENSP00000343535| 0.999| 0.999| 0.987| 0.987|\n|9606.ENSP00000269305|9606.ENSP00000262367| 0.999| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000258149| 0.999| 0.999| 0.999| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000254719| 0.999| 0.999| 0.994| 0.994|\n|9606.ENSP00000269305|9606.ENSP00000418915| 0.999| 0.999| 0.997| 0.997|\n|9606.ENSP00000269305|9606.ENSP00000371475| 0.999| 0.999| 0.997| 0.997|\n|9606.ENSP00000269305|9606.ENSP00000361021| 0.999| 0.999| 0.997| 0.997|\n+--------------------+--------------------+---------+---------+--------------+--------------+\nonly showing top 20 rows\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T10:36:53.517627Z",
"end_time": "2023-07-18T10:36:54.212544Z"
},
"trusted": true
},
"cell_type": "code",
"source": "flat_file = spark.read.csv('gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz', sep=' ', header=True).persist()\nflat_file.printSchema()\nflat_file.show(1, False, True)",
"execution_count": 58,
"outputs": [
{
"output_type": "stream",
"text": "root\n |-- protein1: string (nullable = true)\n |-- protein2: string (nullable = true)\n |-- neighborhood: string (nullable = true)\n |-- fusion: string (nullable = true)\n |-- cooccurence: string (nullable = true)\n |-- coexpression: string (nullable = true)\n |-- experimental: string (nullable = true)\n |-- database: string (nullable = true)\n |-- textmining: string (nullable = true)\n |-- combined_score: string (nullable = true)\n |-- homology: string (nullable = true)\n\n-RECORD 0------------------------------\n protein1 | 9606.ENSP00000000233 \n protein2 | 9606.ENSP00000272298 \n neighborhood | 0 \n fusion | 0 \n cooccurence | 332 \n coexpression | 62 \n experimental | 181 \n database | 0 \n textmining | 125 \n combined_score | 490 \n homology | 0 \nonly showing top 1 row\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "23/07/18 10:36:54 WARN CacheManager: Asked to cache already cached data.\n",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T10:40:05.985717Z",
"end_time": "2023-07-18T10:40:11.996900Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# Our flat-file has 11M interactions.\n(\n flat_file\n .select(\n f.round(f.col('combined_score')/10, 0).alias('rounded_score')\n )\n .groupby('rounded_score')\n .count()\n .orderBy('rounded_score')\n .show()\n)",
"execution_count": 62,
"outputs": [
{
"output_type": "stream",
"text": "\r[Stage 22:> (0 + 1) / 1]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+-------------+-------+\n|rounded_score| count|\n+-------------+-------+\n| 15.0| 614010|\n| 16.0|1124478|\n| 17.0|1013366|\n| 18.0| 806410|\n| 19.0| 974788|\n| 20.0| 652406|\n| 21.0| 592954|\n| 22.0| 500618|\n| 23.0| 430446|\n| 24.0| 348882|\n| 25.0| 310602|\n| 26.0| 271642|\n| 27.0| 294110|\n| 28.0| 232934|\n| 29.0| 224254|\n| 30.0| 194680|\n| 31.0| 188520|\n| 32.0| 151210|\n| 33.0| 144332|\n| 34.0| 118062|\n+-------------+-------+\nonly showing top 20 rows\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Reproduce effort from previous update\n\nFrom [issue](https://github.com/opentargets/issues/issues/1509). Workbook is [here](https://github.com/DSuveges/random_notebooks/blob/master/issue-1509_manually_patch_STRING/Manually+fixing+string+v11.5+release.ipynb)\n"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T11:32:37.310583Z",
"end_time": "2023-07-18T11:33:39.509520Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# Reading 'detailed' dataset:\ndetailed_url = 'https://stringdb-static.org/download/protein.links.detailed.v11.5/9606.protein.links.detailed.v11.5.txt.gz'\ndetailed_df = pd.read_csv(detailed_url, sep=' ', header=0, compression='infer')\nprint(f'Number of pairs in the \"detailed\" dataset: {len(detailed_df)}')\nprint(detailed_df.head())\n\n# Reading 'full' dataset:\nfull_url = 'https://stringdb-static.org/download/protein.links.full.v11.5/9606.protein.links.full.v11.5.txt.gz'\nfull_df = (\n pd.read_csv(full_url, sep=' ', header=0, compression='infer')\n [['protein1', 'protein2', 'homology']]\n)\n\nprint(f'Number of pairs in the \"full\" dataset: {len(full_df)}')\nprint(full_df.head())\n\n## Joining the two dataset:\nmerged_df = (\n detailed_df\n .merge(full_df, on=['protein1', 'protein2'], how='left')\n)\n\n# Number of pairs in the merged dataset: 11_759_454 <- 11_759_455\nprint(f'Number of pairs in the merged dataset: {len(merged_df)}')\n",
"execution_count": 64,
"outputs": [
{
"output_type": "stream",
"text": "Number of pairs in the \"detailed\" dataset: 11938498\n protein1 protein2 neighborhood fusion \\\n0 9606.ENSP00000000233 9606.ENSP00000379496 0 0 \n1 9606.ENSP00000000233 9606.ENSP00000314067 0 0 \n2 9606.ENSP00000000233 9606.ENSP00000263116 0 0 \n3 9606.ENSP00000000233 9606.ENSP00000361263 0 0 \n4 9606.ENSP00000000233 9606.ENSP00000409666 0 0 \n\n cooccurence coexpression experimental database textmining \\\n0 0 54 0 0 144 \n1 0 0 180 0 61 \n2 0 62 152 0 101 \n3 0 0 161 0 64 \n4 0 82 213 0 72 \n\n combined_score \n0 155 \n1 197 \n2 222 \n3 181 \n4 270 \nNumber of pairs in the \"full\" dataset: 11938498\n protein1 protein2 homology\n0 9606.ENSP00000000233 9606.ENSP00000379496 0\n1 9606.ENSP00000000233 9606.ENSP00000314067 0\n2 9606.ENSP00000000233 9606.ENSP00000263116 0\n3 9606.ENSP00000000233 9606.ENSP00000361263 0\n4 9606.ENSP00000000233 9606.ENSP00000409666 0\nNumber of pairs in the merged dataset: 11938498\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T11:37:29.294583Z",
"end_time": "2023-07-18T11:50:02.476947Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# Converting to spark\nmerged_df.iteritems = merged_df.items\n\nmerged_full_2 = (\n merged_full\n .join(\n (\n spark.createDataFrame(merged_df)\n .select(\n f.col('protein1').alias('stringId_A'),\n f.col('protein2').alias('stringId_B'),\n (f.col('combined_score') / 1000).alias('fixed_score')\n )\n ), \n on=['stringId_A', 'stringId_B'], \n how='inner'\n )\n .persist()\n)\n\nmerged_full_2.show()",
"execution_count": 66,
"outputs": [
{
"output_type": "stream",
"text": "23/07/18 11:49:51 WARN TaskSetManager: Stage 26 contains a task of very large size (69545 KiB). The maximum recommended task size is 1000 KiB.\n \r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+--------------------+--------------------+---------+---------+--------------+--------------+-----------+\n| stringId_A| stringId_B|score_old|score_new|score_platform|score_flatfile|fixed_score|\n+--------------------+--------------------+---------+---------+--------------+--------------+-----------+\n|9606.ENSP00000269305|9606.ENSP00000274031| 0.996| NaN| 0.993| 0.993| 0.996|\n|9606.ENSP00000269305|9606.ENSP00000432083| 0.998| NaN| 0.733| 0.733| 0.998|\n|9606.ENSP00000269305|9606.ENSP00000340989| 0.999| 0.999| 0.986| 0.986| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000367545| NaN| 0.985| 0.948| 0.948| 0.958|\n|9606.ENSP00000269305|9606.ENSP00000225174| NaN| 0.993| 0.851| 0.851| 0.992|\n|9606.ENSP00000269305|9606.ENSP00000307684| NaN| 0.983| 0.95| 0.95| 0.981|\n|9606.ENSP00000269305|9606.ENSP00000344352| NaN| 0.982| 0.862| 0.862| 0.974|\n|9606.ENSP00000269305|9606.ENSP00000430432| 0.999| 0.99| 0.986| 0.986| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000341957| 0.999| 0.999| 0.998| 0.998| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000451300| 0.996| NaN| 0.796| 0.796| 0.996|\n|9606.ENSP00000269305|9606.ENSP00000356438| 0.994| NaN| 0.915| 0.915| 0.994|\n|9606.ENSP00000269305|9606.ENSP00000244050| 0.996| 0.992| 0.988| 0.988| 0.996|\n|9606.ENSP00000269305|9606.ENSP00000357113| NaN| 0.988| 0.779| 0.779| 0.985|\n|9606.ENSP00000269305|9606.ENSP00000296930| 0.999| 0.997| 0.991| 0.991| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000347184| 0.997| 0.998| 0.716| 0.716| 0.997|\n|9606.ENSP00000269305|9606.ENSP00000232165| 0.999| NaN| 0.957| 0.957| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000254719| 0.999| 0.999| 0.994| 0.994| 0.999|\n|9606.ENSP00000269305|9606.ENSP00000272317| 0.998| NaN| 0.955| 0.955| 0.998|\n|9606.ENSP00000269305|9606.ENSP00000302961| 0.998| 0.997| 0.881| 0.881| 0.998|\n|9606.ENSP00000269305|9606.ENSP00000360025| NaN| 0.991| 0.989| 0.989| 0.992|\n+--------------------+--------------------+---------+---------+--------------+--------------+-----------+\nonly showing top 20 rows\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T11:50:02.489997Z",
"end_time": "2023-07-18T11:50:02.509019Z"
}
},
"cell_type": "markdown",
"source": "## Denerating data for v.12.0"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:00:45.485080Z",
"end_time": "2023-07-18T13:01:52.429656Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# Reading 'detailed' dataset:\ndetailed_url = 'https://stringdb-downloads.org/download/protein.links.detailed.v12.0/9606.protein.links.detailed.v12.0.txt.gz'\ndetailed_df = pd.read_csv(detailed_url, sep=' ', header=0, compression='infer')\nprint(f'Number of pairs in the \"detailed\" dataset: {len(detailed_df)}')\nprint(detailed_df.head())\n\n# Reading 'full' dataset:\nfull_url = 'https://stringdb-downloads.org/download/protein.links.full.v12.0/9606.protein.links.full.v12.0.txt.gz'\nfull_df = (\n pd.read_csv(full_url, sep=' ', header=0, compression='infer')\n [['protein1', 'protein2', 'homology']]\n)\n\nprint(f'Number of pairs in the \"full\" dataset: {len(full_df)}')\nprint(full_df.head())\n\n## Joining the two dataset:\nmerged_df = (\n detailed_df\n .merge(full_df, on=['protein1', 'protein2'], how='left')\n)\n\n# Number of pairs in the merged dataset: 11_759_454 <- 11_759_455\nprint(f'Number of pairs in the merged dataset: {len(merged_df)}')\nmerged_df.head()",
"execution_count": 69,
"outputs": [
{
"output_type": "stream",
"text": "Number of pairs in the \"detailed\" dataset: 13715404\n protein1 protein2 neighborhood fusion \\\n0 9606.ENSP00000000233 9606.ENSP00000356607 0 0 \n1 9606.ENSP00000000233 9606.ENSP00000427567 0 0 \n2 9606.ENSP00000000233 9606.ENSP00000253413 0 0 \n3 9606.ENSP00000000233 9606.ENSP00000493357 0 0 \n4 9606.ENSP00000000233 9606.ENSP00000324127 0 0 \n\n cooccurence coexpression experimental database textmining \\\n0 0 45 134 0 0 \n1 0 0 128 0 0 \n2 0 118 49 0 0 \n3 0 56 53 0 433 \n4 0 0 46 0 153 \n\n combined_score \n0 173 \n1 154 \n2 151 \n3 471 \n4 201 \nNumber of pairs in the \"full\" dataset: 13715404\n protein1 protein2 homology\n0 9606.ENSP00000000233 9606.ENSP00000356607 0\n1 9606.ENSP00000000233 9606.ENSP00000427567 0\n2 9606.ENSP00000000233 9606.ENSP00000253413 0\n3 9606.ENSP00000000233 9606.ENSP00000493357 0\n4 9606.ENSP00000000233 9606.ENSP00000324127 0\nNumber of pairs in the merged dataset: 13715404\n",
"name": "stdout"
},
{
"output_type": "execute_result",
"execution_count": 69,
"data": {
"text/plain": " protein1 protein2 neighborhood fusion \\\n0 9606.ENSP00000000233 9606.ENSP00000356607 0 0 \n1 9606.ENSP00000000233 9606.ENSP00000427567 0 0 \n2 9606.ENSP00000000233 9606.ENSP00000253413 0 0 \n3 9606.ENSP00000000233 9606.ENSP00000493357 0 0 \n4 9606.ENSP00000000233 9606.ENSP00000324127 0 0 \n\n cooccurence coexpression experimental database textmining \\\n0 0 45 134 0 0 \n1 0 0 128 0 0 \n2 0 118 49 0 0 \n3 0 56 53 0 433 \n4 0 0 46 0 153 \n\n combined_score homology \n0 173 0 \n1 154 0 \n2 151 0 \n3 471 0 \n4 201 0 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>protein1</th>\n <th>protein2</th>\n <th>neighborhood</th>\n <th>fusion</th>\n <th>cooccurence</th>\n <th>coexpression</th>\n <th>experimental</th>\n <th>database</th>\n <th>textmining</th>\n <th>combined_score</th>\n <th>homology</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000356607</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>45</td>\n <td>134</td>\n <td>0</td>\n <td>0</td>\n <td>173</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000427567</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>128</td>\n <td>0</td>\n <td>0</td>\n <td>154</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000253413</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>118</td>\n <td>49</td>\n <td>0</td>\n <td>0</td>\n <td>151</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000493357</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>56</td>\n <td>53</td>\n <td>0</td>\n <td>433</td>\n <td>471</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000324127</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>46</td>\n <td>0</td>\n <td>153</td>\n <td>201</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:02:39.134347Z",
"end_time": "2023-07-18T13:02:39.310186Z"
},
"trusted": true
},
"cell_type": "code",
"source": "top_count = 100\nprotein = 'TP53'\nnew_url = f'{new_api_url}/json/interaction_partners?identifiers={protein}&limit={top_count}'\n\nnew_data = (\n pd.read_json(new_url)\n .drop(['preferredName_A', 'ncbiTaxonId'], axis=1)\n .rename(\n columns={\n col: f'{col}_new'\n for col in 'preferredName_B score nscore fscore pscore ascore escore dscore tscore'.split(' ')\n }\n )\n)\n\nnew_data.head()",
"execution_count": 71,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 71,
"data": {
"text/plain": " stringId_A stringId_B preferredName_B_new score_new \\\n0 9606.ENSP00000269305 9606.ENSP00000340989 SFN 0.999 \n1 9606.ENSP00000269305 9606.ENSP00000263253 EP300 0.999 \n2 9606.ENSP00000269305 9606.ENSP00000437955 HIF1A 0.999 \n3 9606.ENSP00000269305 9606.ENSP00000362649 HDAC1 0.999 \n4 9606.ENSP00000269305 9606.ENSP00000335153 HSP90AA1 0.999 \n\n nscore_new fscore_new pscore_new ascore_new escore_new dscore_new \\\n0 0 0 0.0 0.000 0.981 0.75 \n1 0 0 0.0 0.049 0.999 0.90 \n2 0 0 0.0 0.000 0.847 0.00 \n3 0 0 0.0 0.109 0.924 0.50 \n4 0 0 0.0 0.000 0.903 0.00 \n\n tscore_new \n0 0.859 \n1 0.998 \n2 0.994 \n3 0.994 \n4 0.995 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>preferredName_B_new</th>\n <th>score_new</th>\n <th>nscore_new</th>\n <th>fscore_new</th>\n <th>pscore_new</th>\n <th>ascore_new</th>\n <th>escore_new</th>\n <th>dscore_new</th>\n <th>tscore_new</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000340989</td>\n <td>SFN</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>0.981</td>\n <td>0.75</td>\n <td>0.859</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000263253</td>\n <td>EP300</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.049</td>\n <td>0.999</td>\n <td>0.90</td>\n <td>0.998</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000437955</td>\n <td>HIF1A</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>0.847</td>\n <td>0.00</td>\n <td>0.994</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000362649</td>\n <td>HDAC1</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.109</td>\n <td>0.924</td>\n <td>0.50</td>\n <td>0.994</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000335153</td>\n <td>HSP90AA1</td>\n <td>0.999</td>\n <td>0</td>\n <td>0</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>0.903</td>\n <td>0.00</td>\n <td>0.995</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:13:40.337398Z",
"end_time": "2023-07-18T13:13:50.146265Z"
},
"trusted": true
},
"cell_type": "code",
"source": "m = (\n new_data\n .merge(\n merged_df.rename(columns={'protein1': 'stringId_A', 'protein2': 'stringId_B'}),\n on=['stringId_A', 'stringId_B'], \n how='inner'\n )\n)\n\nm[['stringId_A', 'stringId_B', 'score_new', 'combined_score']]",
"execution_count": 73,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 73,
"data": {
"text/plain": " stringId_A stringId_B score_new combined_score\n0 9606.ENSP00000269305 9606.ENSP00000340989 0.999 999\n1 9606.ENSP00000269305 9606.ENSP00000263253 0.999 999\n2 9606.ENSP00000269305 9606.ENSP00000437955 0.999 999\n3 9606.ENSP00000269305 9606.ENSP00000362649 0.999 999\n4 9606.ENSP00000269305 9606.ENSP00000335153 0.999 999\n.. ... ... ... ...\n95 9606.ENSP00000269305 9606.ENSP00000380024 0.982 982\n96 9606.ENSP00000269305 9606.ENSP00000361186 0.982 982\n97 9606.ENSP00000269305 9606.ENSP00000394560 0.982 982\n98 9606.ENSP00000269305 9606.ENSP00000402107 0.982 982\n99 9606.ENSP00000269305 9606.ENSP00000300093 0.981 981\n\n[100 rows x 4 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>score_new</th>\n <th>combined_score</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000340989</td>\n <td>0.999</td>\n <td>999</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000263253</td>\n <td>0.999</td>\n <td>999</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000437955</td>\n <td>0.999</td>\n <td>999</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000362649</td>\n <td>0.999</td>\n <td>999</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000335153</td>\n <td>0.999</td>\n <td>999</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>95</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000380024</td>\n <td>0.982</td>\n <td>982</td>\n </tr>\n <tr>\n <th>96</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000361186</td>\n <td>0.982</td>\n <td>982</td>\n </tr>\n <tr>\n <th>97</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000394560</td>\n <td>0.982</td>\n <td>982</td>\n </tr>\n <tr>\n <th>98</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000402107</td>\n <td>0.982</td>\n <td>982</td>\n </tr>\n <tr>\n <th>99</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000300093</td>\n <td>0.981</td>\n <td>981</td>\n </tr>\n </tbody>\n</table>\n<p>100 rows × 4 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:14:33.433119Z",
"end_time": "2023-07-18T13:14:33.446500Z"
},
"trusted": true
},
"cell_type": "code",
"source": "m[['stringId_A', 'stringId_B', 'escore_new', 'experimental']]",
"execution_count": 74,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 74,
"data": {
"text/plain": " stringId_A stringId_B escore_new experimental\n0 9606.ENSP00000269305 9606.ENSP00000340989 0.981 981\n1 9606.ENSP00000269305 9606.ENSP00000263253 0.999 999\n2 9606.ENSP00000269305 9606.ENSP00000437955 0.847 847\n3 9606.ENSP00000269305 9606.ENSP00000362649 0.924 924\n4 9606.ENSP00000269305 9606.ENSP00000335153 0.903 903\n.. ... ... ... ...\n95 9606.ENSP00000269305 9606.ENSP00000380024 0.874 874\n96 9606.ENSP00000269305 9606.ENSP00000361186 0.510 510\n97 9606.ENSP00000269305 9606.ENSP00000394560 0.641 641\n98 9606.ENSP00000269305 9606.ENSP00000402107 0.488 488\n99 9606.ENSP00000269305 9606.ENSP00000300093 0.737 737\n\n[100 rows x 4 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>escore_new</th>\n <th>experimental</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000340989</td>\n <td>0.981</td>\n <td>981</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000263253</td>\n <td>0.999</td>\n <td>999</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000437955</td>\n <td>0.847</td>\n <td>847</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000362649</td>\n <td>0.924</td>\n <td>924</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000335153</td>\n <td>0.903</td>\n <td>903</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>95</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000380024</td>\n <td>0.874</td>\n <td>874</td>\n </tr>\n <tr>\n <th>96</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000361186</td>\n <td>0.510</td>\n <td>510</td>\n </tr>\n <tr>\n <th>97</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000394560</td>\n <td>0.641</td>\n <td>641</td>\n </tr>\n <tr>\n <th>98</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000402107</td>\n <td>0.488</td>\n <td>488</td>\n </tr>\n <tr>\n <th>99</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000300093</td>\n <td>0.737</td>\n <td>737</td>\n </tr>\n </tbody>\n</table>\n<p>100 rows × 4 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:14:53.649691Z",
"end_time": "2023-07-18T13:14:53.662937Z"
},
"trusted": true
},
"cell_type": "code",
"source": "m[['stringId_A', 'stringId_B', 'dscore_new', 'database']]",
"execution_count": 75,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 75,
"data": {
"text/plain": " stringId_A stringId_B dscore_new database\n0 9606.ENSP00000269305 9606.ENSP00000340989 0.75 750\n1 9606.ENSP00000269305 9606.ENSP00000263253 0.90 900\n2 9606.ENSP00000269305 9606.ENSP00000437955 0.00 0\n3 9606.ENSP00000269305 9606.ENSP00000362649 0.50 500\n4 9606.ENSP00000269305 9606.ENSP00000335153 0.00 0\n.. ... ... ... ...\n95 9606.ENSP00000269305 9606.ENSP00000380024 0.00 0\n96 9606.ENSP00000269305 9606.ENSP00000361186 0.75 750\n97 9606.ENSP00000269305 9606.ENSP00000394560 0.90 900\n98 9606.ENSP00000269305 9606.ENSP00000402107 0.70 700\n99 9606.ENSP00000269305 9606.ENSP00000300093 0.00 0\n\n[100 rows x 4 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>dscore_new</th>\n <th>database</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000340989</td>\n <td>0.75</td>\n <td>750</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000263253</td>\n <td>0.90</td>\n <td>900</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000437955</td>\n <td>0.00</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000362649</td>\n <td>0.50</td>\n <td>500</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000335153</td>\n <td>0.00</td>\n <td>0</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>95</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000380024</td>\n <td>0.00</td>\n <td>0</td>\n </tr>\n <tr>\n <th>96</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000361186</td>\n <td>0.75</td>\n <td>750</td>\n </tr>\n <tr>\n <th>97</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000394560</td>\n <td>0.90</td>\n <td>900</td>\n </tr>\n <tr>\n <th>98</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000402107</td>\n <td>0.70</td>\n <td>700</td>\n </tr>\n <tr>\n <th>99</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000300093</td>\n <td>0.00</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>100 rows × 4 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:16:53.025832Z",
"end_time": "2023-07-18T13:16:53.039187Z"
},
"trusted": true
},
"cell_type": "code",
"source": "m[['stringId_A', 'stringId_B', 'pscore_new', 'homology']].query('pscore_new != 0')",
"execution_count": 77,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 77,
"data": {
"text/plain": " stringId_A stringId_B pscore_new homology\n86 9606.ENSP00000269305 9606.ENSP00000367545 0.068 876",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>stringId_A</th>\n <th>stringId_B</th>\n <th>pscore_new</th>\n <th>homology</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>86</th>\n <td>9606.ENSP00000269305</td>\n <td>9606.ENSP00000367545</td>\n <td>0.068</td>\n <td>876</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:18:05.153261Z",
"end_time": "2023-07-18T13:22:33.139845Z"
},
"trusted": true
},
"cell_type": "code",
"source": "merged_df.to_csv('gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz', sep=' ', index=False, compression='infer')",
"execution_count": 78,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:38:04.932295Z",
"end_time": "2023-07-18T13:38:11.921046Z"
},
"trusted": true
},
"cell_type": "code",
"source": "%%bash\n\nfile_name='gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz'\n\ncat <(gsutil cat ${file_name} | zcat | head -n1) \\\n <(gsutil cat ${file_name} | zcat | grep \"9606.ENSP00000269305 9606.ENSP00000340989\" | head -n1 ) \\\n | column -t",
"execution_count": 80,
"outputs": [
{
"output_type": "stream",
"text": "protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score homology\n9606.ENSP00000269305 9606.ENSP00000340989 0 0 0 0 981 750 859 999 0\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Flat file vs API reponse:\n\n**Flat file:**\n\n```\nprotein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score homology\n9606.ENSP00000269305 9606.ENSP00000340989 0 0 0 0 981 750 859 999 0\n```\n\n\n**API reponse:**\n\n- [URL]()\n\n```json\n{\n \"stringId_A\": \"9606.ENSP00000269305\",\n \"stringId_B\": \"9606.ENSP00000340989\",\n \"preferredName_A\": \"TP53\",\n \"preferredName_B\": \"SFN\",\n \"ncbiTaxonId\": 9606,\n \"score\": 0.999,\n \"nscore\": 0,\n \"fscore\": 0,\n \"pscore\": 0,\n \"ascore\": 0,\n \"escore\": 0.981,\n \"dscore\": 0.75,\n \"tscore\": 0.859\n}\n```"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T13:41:57.831015Z",
"end_time": "2023-07-18T13:42:10.800736Z"
},
"trusted": true
},
"cell_type": "code",
"source": "%%bash\n\ngsutil cat gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz | zcat | wc -l\ngsutil cat gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz | zcat | wc -l\n\n",
"execution_count": 81,
"outputs": [
{
"output_type": "stream",
"text": "13715405\n11759455\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-07-18T14:57:17.613495Z",
"end_time": "2023-07-18T14:57:17.946933Z"
},
"trusted": true
},
"cell_type": "code",
"source": "merged_df.query(\"homology != 0\").head()",
"execution_count": 83,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 83,
"data": {
"text/plain": " protein1 protein2 neighborhood fusion \\\n26 9606.ENSP00000000233 9606.ENSP00000310226 0 0 \n44 9606.ENSP00000000233 9606.ENSP00000377769 0 0 \n109 9606.ENSP00000000233 9606.ENSP00000385432 0 0 \n131 9606.ENSP00000000233 9606.ENSP00000331748 0 0 \n173 9606.ENSP00000000233 9606.ENSP00000300935 0 0 \n\n cooccurence coexpression experimental database textmining \\\n26 0 109 110 500 154 \n44 0 0 162 0 77 \n109 0 124 0 0 75 \n131 0 74 98 0 59 \n173 0 117 334 0 126 \n\n combined_score homology \n26 648 655 \n44 217 792 \n109 178 790 \n131 158 642 \n173 460 656 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>protein1</th>\n <th>protein2</th>\n <th>neighborhood</th>\n <th>fusion</th>\n <th>cooccurence</th>\n <th>coexpression</th>\n <th>experimental</th>\n <th>database</th>\n <th>textmining</th>\n <th>combined_score</th>\n <th>homology</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>26</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000310226</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>109</td>\n <td>110</td>\n <td>500</td>\n <td>154</td>\n <td>648</td>\n <td>655</td>\n </tr>\n <tr>\n <th>44</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000377769</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>162</td>\n <td>0</td>\n <td>77</td>\n <td>217</td>\n <td>792</td>\n </tr>\n <tr>\n <th>109</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000385432</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>124</td>\n <td>0</td>\n <td>0</td>\n <td>75</td>\n <td>178</td>\n <td>790</td>\n </tr>\n <tr>\n <th>131</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000331748</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>74</td>\n <td>98</td>\n <td>0</td>\n <td>59</td>\n <td>158</td>\n <td>642</td>\n </tr>\n <tr>\n <th>173</th>\n <td>9606.ENSP00000000233</td>\n <td>9606.ENSP00000300935</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>117</td>\n <td>334</td>\n <td>0</td>\n <td>126</td>\n <td>460</td>\n <td>656</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.10.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "GCS/Updating STRING.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment