DSuveges/Updating STRING.ipynb

## Updating STRING.ipynb
{
  "cells": [
    {
      "metadata": {},
      "id": "56e12f81",
      "cell_type": "markdown",
      "source": "# Updating STRING\n\nThe STRING database has been updated to v.12.0, which is now available on a release specific website before it got fully released.\n\n- New URL: https://version-12-0.string-db.org/ (v.12.0)\n- Old URL: https://string-db.org/ (v.11.5)\n\nThe data is avilable through API, UI and downloadable files. \n\n**The aim of this investigation**\n\n- Identify if the data has changed. \n- Identify if the data differs from what we have on production.\n- Identify relevant downloadable files and columns to extract data from.\n- Prototype parser to generate dataset.\n\n\n## Conclusions\n\n- There are two websites, with two API urls. \n- However the version endpoint of the API shows the same version (v.11.5).\n- However the version is the same, the returned data is different. :D \n- The OT Platform STRING data is not consistent with either version.\n- Interaction scores shown on the Platform is consistent with `gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz` \n- Investigation of the original ticket ([#1509](https://github.com/opentargets/issues/issues/1509)) shows the actual platform dataset is based on v.11.0. \n- When updated to v.11.5, the data is good and consistent with the API.\n- I could updated the file to v.12.0, and is consistent with the new API. File is uploaded to the bucket.\n"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Testing out the two versions of APIs:"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:35:19.795313Z",
          "start_time": "2023-07-18T09:35:19.608673Z"
        },
        "trusted": false
      },
      "id": "71587ca0",
      "cell_type": "code",
      "source": "import pandas as pd\nfrom requests import get, post\n\nold_api_url = 'https://string-db.org/api'\nnew_api_url = 'https://version-12-0.string-db.org/api'\n\n# Get release version:\nversion = 'json/version'\n\nnew_version = get(f'{new_api_url}/{version}').json()[0]['string_version']\nold_version = get(f'{old_api_url}/{version}').json()[0]['string_version']\n\nprint(f'New version: {new_version}')\nprint(f'Old version: {old_version}')",
      "execution_count": 20,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "New version: 11.5\nOld version: 11.5\n"
        }
      ]
    },
    {
      "metadata": {},
      "id": "ca71e070",
      "cell_type": "markdown",
      "source": "Retrieve the top 100 interaction partners for TP53:"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:13:49.860195Z",
          "start_time": "2023-07-18T09:13:49.512999Z"
        },
        "trusted": false
      },
      "id": "b241cf5c",
      "cell_type": "code",
      "source": "top_count = 100\nprotein = 'TP53'\nnew_url = f'{new_api_url}/json/interaction_partners?identifiers={protein}&limit={top_count}'\n\nnew_data = (\n    pd.read_json(new_url)\n    .drop(['preferredName_A', 'ncbiTaxonId'], axis=1)\n    .rename(\n        columns={\n            col: f'{col}_new'\n            for col in 'preferredName_B score nscore fscore pscore ascore escore dscore tscore'.split(' ')\n        }\n    )\n)\n\nold_data = (\n    pd.read_json(f'{old_api_url}/json/interaction_partners?identifiers={protein}&limit={top_count}')\n    .drop(['preferredName_A', 'ncbiTaxonId'], axis=1)\n    .rename(\n        columns={\n            col: f'{col}_old'\n            for col in 'preferredName_B score nscore fscore pscore ascore escore dscore tscore'.split(' ')\n        }\n    )\n)\n\nnew_data.head()",
      "execution_count": 12,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>preferredName_B_new</th>\n      <th>score_new</th>\n      <th>nscore_new</th>\n      <th>fscore_new</th>\n      <th>pscore_new</th>\n      <th>ascore_new</th>\n      <th>escore_new</th>\n      <th>dscore_new</th>\n      <th>tscore_new</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000340989</td>\n      <td>SFN</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.000</td>\n      <td>0.981</td>\n      <td>0.75</td>\n      <td>0.859</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000263253</td>\n      <td>EP300</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.049</td>\n      <td>0.999</td>\n      <td>0.90</td>\n      <td>0.998</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000437955</td>\n      <td>HIF1A</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.000</td>\n      <td>0.847</td>\n      <td>0.00</td>\n      <td>0.994</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000362649</td>\n      <td>HDAC1</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.109</td>\n      <td>0.924</td>\n      <td>0.50</td>\n      <td>0.994</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000335153</td>\n      <td>HSP90AA1</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.000</td>\n      <td>0.903</td>\n      <td>0.00</td>\n      <td>0.995</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "             stringId_A            stringId_B preferredName_B_new  score_new  \\\n0  9606.ENSP00000269305  9606.ENSP00000340989                 SFN      0.999   \n1  9606.ENSP00000269305  9606.ENSP00000263253               EP300      0.999   \n2  9606.ENSP00000269305  9606.ENSP00000437955               HIF1A      0.999   \n3  9606.ENSP00000269305  9606.ENSP00000362649               HDAC1      0.999   \n4  9606.ENSP00000269305  9606.ENSP00000335153            HSP90AA1      0.999   \n\n   nscore_new  fscore_new  pscore_new  ascore_new  escore_new  dscore_new  \\\n0           0           0         0.0       0.000       0.981        0.75   \n1           0           0         0.0       0.049       0.999        0.90   \n2           0           0         0.0       0.000       0.847        0.00   \n3           0           0         0.0       0.109       0.924        0.50   \n4           0           0         0.0       0.000       0.903        0.00   \n\n   tscore_new  \n0       0.859  \n1       0.998  \n2       0.994  \n3       0.994  \n4       0.995  "
          },
          "execution_count": 12,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:20:19.167532Z",
          "start_time": "2023-07-18T09:20:19.148396Z"
        },
        "trusted": false
      },
      "id": "e88a5e7b",
      "cell_type": "code",
      "source": "merged = (\n    new_data\n    .merge(\n        old_data, on=['stringId_A', 'stringId_B'], how='inner'\n    )\n)\n\nprint(len(merged))\nmerged.query('score_new != score_old')[['stringId_A', 'stringId_B', 'score_new', 'score_old']].head()",
      "execution_count": 15,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "59\n"
        },
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>score_new</th>\n      <th>score_old</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>7</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000372023</td>\n      <td>0.999</td>\n      <td>0.998</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000381185</td>\n      <td>0.999</td>\n      <td>0.995</td>\n    </tr>\n    <tr>\n      <th>21</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000321239</td>\n      <td>0.998</td>\n      <td>0.999</td>\n    </tr>\n    <tr>\n      <th>22</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000387699</td>\n      <td>0.998</td>\n      <td>0.994</td>\n    </tr>\n    <tr>\n      <th>23</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000347184</td>\n      <td>0.998</td>\n      <td>0.997</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "              stringId_A            stringId_B  score_new  score_old\n7   9606.ENSP00000269305  9606.ENSP00000372023      0.999      0.998\n8   9606.ENSP00000269305  9606.ENSP00000381185      0.999      0.995\n21  9606.ENSP00000269305  9606.ENSP00000321239      0.998      0.999\n22  9606.ENSP00000269305  9606.ENSP00000387699      0.998      0.994\n23  9606.ENSP00000269305  9606.ENSP00000347184      0.998      0.997"
          },
          "execution_count": 15,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:23:40.396159Z",
          "start_time": "2023-07-18T09:23:40.391726Z"
        },
        "trusted": false
      },
      "id": "bafc07ff",
      "cell_type": "code",
      "source": "new_url",
      "execution_count": 18,
      "outputs": [
        {
          "data": {
            "text/plain": "'https://version-12-0.string-db.org/api/json/interaction_partners?identifiers=TP53&limit=100'"
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:20:57.186437Z",
          "start_time": "2023-07-18T09:20:57.100738Z"
        },
        "trusted": false
      },
      "id": "02fa1f79",
      "cell_type": "code",
      "source": "get(f'{old_api_url}/{version}').json()",
      "execution_count": 17,
      "outputs": [
        {
          "data": {
            "text/plain": "[{'string_version': '11.5',\n  'stable_address': 'https://version-11-5.string-db.org'}]"
          },
          "execution_count": 17,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:29:44.647608Z",
          "start_time": "2023-07-18T09:29:44.383483Z"
        },
        "trusted": false
      },
      "id": "8ef061c2",
      "cell_type": "code",
      "source": "import json \nprint(json.dumps(get(f'{old_api_url}/{version}').json(), indent=2))",
      "execution_count": 19,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "[\n  {\n    \"string_version\": \"11.5\",\n    \"stable_address\": \"https://version-11-5.string-db.org\"\n  }\n]\n"
        }
      ]
    },
    {
      "metadata": {},
      "id": "d679994d",
      "cell_type": "markdown",
      "source": "## Extract string interactions from OT platform"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:46:39.585219Z",
          "start_time": "2023-07-18T09:46:34.372824Z"
        },
        "trusted": false
      },
      "id": "306f5951",
      "cell_type": "code",
      "source": "query = '''query InteractionsSectionQuery($ensgId: String!, $sourceDatabase: String, $index: Int = 0, $size: Int = 10) {\n  target(ensemblId: $ensgId) {\n    id\n    interactions(\n      sourceDatabase: $sourceDatabase\n      page: {index: $index, size: $size}\n    ) {\n      rows {\n        intA\n        intB\n        targetB {\n          approvedSymbol\n          id\n        }\n        score\n        evidences {\n          evidenceScore\n          interactionDetectionMethodShortName\n        }\n      }\n    }\n  }\n}\n'''\n\n# params = {\n#   \"index\": 0,\n#   \"size\": 10000,\n#   \"ensgId\": \"ENSG00000171388\",\n#   \"sourceDatabase\": \"string\"\n# }\n\n\nfull_query = {\n  \"operationName\": \"InteractionsSectionQuery\",\n  \"variables\": {\n    \"index\": 0,\n    \"size\": 10000,\n    \"ensgId\": \"ENSG00000141510\",\n    \"sourceDatabase\": \"string\"\n  },\n  \"query\": query\n}\n\nplatform_api_url = 'https://api.platform.opentargets.org/api/v4/graphql'\ndata = post(platform_api_url, json=full_query)\n\n",
      "execution_count": 35,
      "outputs": []
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:46:39.645278Z",
          "start_time": "2023-07-18T09:46:39.589141Z"
        },
        "trusted": false
      },
      "id": "9c197097",
      "cell_type": "code",
      "source": "df_platform = pd.DataFrame([\n    {\n        'stringId_A': f\"9606.{row['intA']}\",\n        'stringId_B': f\"9606.{row['intB']}\",\n        'score_platform': row['score']\n    }\n    for row in data.json()['data']['target']['interactions']['rows']\n])\n\ndf_platform.head()",
      "execution_count": 36,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>score_platform</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000384849</td>\n      <td>0.999</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000278616</td>\n      <td>0.999</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000262367</td>\n      <td>0.999</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000258149</td>\n      <td>0.999</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000356150</td>\n      <td>0.999</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "             stringId_A            stringId_B  score_platform\n0  9606.ENSP00000269305  9606.ENSP00000384849           0.999\n1  9606.ENSP00000269305  9606.ENSP00000278616           0.999\n2  9606.ENSP00000269305  9606.ENSP00000262367           0.999\n3  9606.ENSP00000269305  9606.ENSP00000258149           0.999\n4  9606.ENSP00000269305  9606.ENSP00000356150           0.999"
          },
          "execution_count": 36,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:48:17.391476Z",
          "start_time": "2023-07-18T09:48:17.375655Z"
        },
        "trusted": false
      },
      "id": "07b63454",
      "cell_type": "code",
      "source": "merged = (\n    new_data\n    .merge(\n        old_data, on=['stringId_A', 'stringId_B'], how='outer'\n    )\n    .merge(\n        df_platform, on=['stringId_A', 'stringId_B'], how='left'\n    )\n    [['stringId_A', 'stringId_B', 'score_old', 'score_new', 'score_platform']]\n)\n\nlen(merged)",
      "execution_count": 40,
      "outputs": [
        {
          "data": {
            "text/plain": "141"
          },
          "execution_count": 40,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:52:52.744108Z",
          "start_time": "2023-07-18T09:52:48.868389Z"
        },
        "trusted": false
      },
      "id": "9735b399",
      "cell_type": "code",
      "source": "from pyspark.sql import SparkSession, functions as f, types as t\nspark = SparkSession.builder.getOrCreate()\nmerged.iteritems = merged.items\n\ndf = spark.createDataFrame(merged).persist()\ndf.show()",
      "execution_count": 47,
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": "                                                                                \r"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "+--------------------+--------------------+---------+---------+--------------+\n|          stringId_A|          stringId_B|score_old|score_new|score_platform|\n+--------------------+--------------------+---------+---------+--------------+\n|9606.ENSP00000269305|9606.ENSP00000340989|    0.999|    0.999|         0.986|\n|9606.ENSP00000269305|9606.ENSP00000263253|    0.999|    0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000437955|    0.999|    0.999|         0.977|\n|9606.ENSP00000269305|9606.ENSP00000362649|    0.999|    0.999|          0.99|\n|9606.ENSP00000269305|9606.ENSP00000335153|    0.999|    0.999|          0.99|\n|9606.ENSP00000269305|9606.ENSP00000278616|    0.999|    0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000356150|    0.999|    0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000372023|    0.998|    0.999|         0.998|\n|9606.ENSP00000269305|9606.ENSP00000381185|    0.995|    0.999|         0.984|\n|9606.ENSP00000269305|9606.ENSP00000365230|      NaN|    0.999|           NaN|\n|9606.ENSP00000269305|9606.ENSP00000354218|      NaN|    0.999|           NaN|\n|9606.ENSP00000269305|9606.ENSP00000266000|      NaN|    0.999|           NaN|\n|9606.ENSP00000269305|9606.ENSP00000212015|    0.999|    0.999|         0.996|\n|9606.ENSP00000269305|9606.ENSP00000418960|    0.999|    0.999|         0.996|\n|9606.ENSP00000269305|9606.ENSP00000384849|    0.999|    0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000341957|    0.999|    0.999|         0.998|\n|9606.ENSP00000269305|9606.ENSP00000497594|      NaN|    0.999|           NaN|\n|9606.ENSP00000269305|9606.ENSP00000343535|    0.999|    0.999|         0.987|\n|9606.ENSP00000269305|9606.ENSP00000262367|    0.999|    0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000258149|    0.999|    0.999|         0.999|\n+--------------------+--------------------+---------+---------+--------------+\nonly showing top 20 rows\n\n"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-07-18T09:52:41.622954Z",
          "start_time": "2023-07-18T09:52:39.301406Z"
        },
        "trusted": false
      },
      "id": "860f587f",
      "cell_type": "code",
      "source": "merged.iteritems = merged.items\nspark.createDataFrame(merged)",
      "execution_count": 46,
      "outputs": [
        {
          "data": {
            "text/plain": "DataFrame[stringId_A: string, stringId_B: string, score_old: double, score_new: double, score_platform: double]"
          },
          "execution_count": 46,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T10:20:16.135172Z",
          "end_time": "2023-07-18T10:21:01.145258Z"
        },
        "trusted": true
      },
      "id": "9d64eba6",
      "cell_type": "code",
      "source": "merged_full = (\n    df\n    .join(\n        (\n            spark.read.csv('gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz', sep=' ', header=True)\n            .select(\n                f.col('protein1').alias('stringId_A'),\n                f.col('protein2').alias('stringId_B'),\n                (f.col('combined_score') / 1000).alias('score_flatfile')\n            )\n        ), \n        on=['stringId_A', 'stringId_B'], \n        how='inner'\n    )\n    .persist()\n)\n\nmerged_full.show()",
      "execution_count": 55,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 9:>                                                          (0 + 1) / 1]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+--------------------+--------------------+---------+---------+--------------+--------------+\n|          stringId_A|          stringId_B|score_old|score_new|score_platform|score_flatfile|\n+--------------------+--------------------+---------+---------+--------------+--------------+\n|9606.ENSP00000269305|9606.ENSP00000340989|    0.999|    0.999|         0.986|         0.986|\n|9606.ENSP00000269305|9606.ENSP00000263253|    0.999|    0.999|         0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000437955|    0.999|    0.999|         0.977|         0.977|\n|9606.ENSP00000269305|9606.ENSP00000362649|    0.999|    0.999|          0.99|          0.99|\n|9606.ENSP00000269305|9606.ENSP00000335153|    0.999|    0.999|          0.99|          0.99|\n|9606.ENSP00000269305|9606.ENSP00000278616|    0.999|    0.999|         0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000356150|    0.999|    0.999|         0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000372023|    0.998|    0.999|         0.998|         0.998|\n|9606.ENSP00000269305|9606.ENSP00000381185|    0.995|    0.999|         0.984|         0.984|\n|9606.ENSP00000269305|9606.ENSP00000212015|    0.999|    0.999|         0.996|         0.996|\n|9606.ENSP00000269305|9606.ENSP00000418960|    0.999|    0.999|         0.996|         0.996|\n|9606.ENSP00000269305|9606.ENSP00000384849|    0.999|    0.999|         0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000341957|    0.999|    0.999|         0.998|         0.998|\n|9606.ENSP00000269305|9606.ENSP00000343535|    0.999|    0.999|         0.987|         0.987|\n|9606.ENSP00000269305|9606.ENSP00000262367|    0.999|    0.999|         0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000258149|    0.999|    0.999|         0.999|         0.999|\n|9606.ENSP00000269305|9606.ENSP00000254719|    0.999|    0.999|         0.994|         0.994|\n|9606.ENSP00000269305|9606.ENSP00000418915|    0.999|    0.999|         0.997|         0.997|\n|9606.ENSP00000269305|9606.ENSP00000371475|    0.999|    0.999|         0.997|         0.997|\n|9606.ENSP00000269305|9606.ENSP00000361021|    0.999|    0.999|         0.997|         0.997|\n+--------------------+--------------------+---------+---------+--------------+--------------+\nonly showing top 20 rows\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T10:36:53.517627Z",
          "end_time": "2023-07-18T10:36:54.212544Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "flat_file = spark.read.csv('gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz', sep=' ', header=True).persist()\nflat_file.printSchema()\nflat_file.show(1, False, True)",
      "execution_count": 58,
      "outputs": [
        {
          "output_type": "stream",
          "text": "root\n |-- protein1: string (nullable = true)\n |-- protein2: string (nullable = true)\n |-- neighborhood: string (nullable = true)\n |-- fusion: string (nullable = true)\n |-- cooccurence: string (nullable = true)\n |-- coexpression: string (nullable = true)\n |-- experimental: string (nullable = true)\n |-- database: string (nullable = true)\n |-- textmining: string (nullable = true)\n |-- combined_score: string (nullable = true)\n |-- homology: string (nullable = true)\n\n-RECORD 0------------------------------\n protein1       | 9606.ENSP00000000233 \n protein2       | 9606.ENSP00000272298 \n neighborhood   | 0                    \n fusion         | 0                    \n cooccurence    | 332                  \n coexpression   | 62                   \n experimental   | 181                  \n database       | 0                    \n textmining     | 125                  \n combined_score | 490                  \n homology       | 0                    \nonly showing top 1 row\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "23/07/18 10:36:54 WARN CacheManager: Asked to cache already cached data.\n",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T10:40:05.985717Z",
          "end_time": "2023-07-18T10:40:11.996900Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Our flat-file has 11M interactions.\n(\n    flat_file\n    .select(\n        f.round(f.col('combined_score')/10, 0).alias('rounded_score')\n    )\n    .groupby('rounded_score')\n    .count()\n    .orderBy('rounded_score')\n    .show()\n)",
      "execution_count": 62,
      "outputs": [
        {
          "output_type": "stream",
          "text": "\r[Stage 22:>                                                         (0 + 1) / 1]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+-------------+-------+\n|rounded_score|  count|\n+-------------+-------+\n|         15.0| 614010|\n|         16.0|1124478|\n|         17.0|1013366|\n|         18.0| 806410|\n|         19.0| 974788|\n|         20.0| 652406|\n|         21.0| 592954|\n|         22.0| 500618|\n|         23.0| 430446|\n|         24.0| 348882|\n|         25.0| 310602|\n|         26.0| 271642|\n|         27.0| 294110|\n|         28.0| 232934|\n|         29.0| 224254|\n|         30.0| 194680|\n|         31.0| 188520|\n|         32.0| 151210|\n|         33.0| 144332|\n|         34.0| 118062|\n+-------------+-------+\nonly showing top 20 rows\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Reproduce effort from previous update\n\nFrom [issue](https://github.com/opentargets/issues/issues/1509). Workbook is [here](https://github.com/DSuveges/random_notebooks/blob/master/issue-1509_manually_patch_STRING/Manually+fixing+string+v11.5+release.ipynb)\n"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T11:32:37.310583Z",
          "end_time": "2023-07-18T11:33:39.509520Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Reading 'detailed' dataset:\ndetailed_url = 'https://stringdb-static.org/download/protein.links.detailed.v11.5/9606.protein.links.detailed.v11.5.txt.gz'\ndetailed_df = pd.read_csv(detailed_url, sep=' ', header=0, compression='infer')\nprint(f'Number of pairs in the \"detailed\" dataset: {len(detailed_df)}')\nprint(detailed_df.head())\n\n# Reading 'full' dataset:\nfull_url = 'https://stringdb-static.org/download/protein.links.full.v11.5/9606.protein.links.full.v11.5.txt.gz'\nfull_df = (\n    pd.read_csv(full_url, sep=' ', header=0, compression='infer')\n    [['protein1', 'protein2', 'homology']]\n)\n\nprint(f'Number of pairs in the \"full\" dataset: {len(full_df)}')\nprint(full_df.head())\n\n## Joining the two dataset:\nmerged_df = (\n    detailed_df\n    .merge(full_df, on=['protein1', 'protein2'], how='left')\n)\n\n# Number of pairs in the merged dataset: 11_759_454 <- 11_759_455\nprint(f'Number of pairs in the merged dataset: {len(merged_df)}')\n",
      "execution_count": 64,
      "outputs": [
        {
          "output_type": "stream",
          "text": "Number of pairs in the \"detailed\" dataset: 11938498\n               protein1              protein2  neighborhood  fusion  \\\n0  9606.ENSP00000000233  9606.ENSP00000379496             0       0   \n1  9606.ENSP00000000233  9606.ENSP00000314067             0       0   \n2  9606.ENSP00000000233  9606.ENSP00000263116             0       0   \n3  9606.ENSP00000000233  9606.ENSP00000361263             0       0   \n4  9606.ENSP00000000233  9606.ENSP00000409666             0       0   \n\n   cooccurence  coexpression  experimental  database  textmining  \\\n0            0            54             0         0         144   \n1            0             0           180         0          61   \n2            0            62           152         0         101   \n3            0             0           161         0          64   \n4            0            82           213         0          72   \n\n   combined_score  \n0             155  \n1             197  \n2             222  \n3             181  \n4             270  \nNumber of pairs in the \"full\" dataset: 11938498\n               protein1              protein2  homology\n0  9606.ENSP00000000233  9606.ENSP00000379496         0\n1  9606.ENSP00000000233  9606.ENSP00000314067         0\n2  9606.ENSP00000000233  9606.ENSP00000263116         0\n3  9606.ENSP00000000233  9606.ENSP00000361263         0\n4  9606.ENSP00000000233  9606.ENSP00000409666         0\nNumber of pairs in the merged dataset: 11938498\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T11:37:29.294583Z",
          "end_time": "2023-07-18T11:50:02.476947Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Converting to spark\nmerged_df.iteritems = merged_df.items\n\nmerged_full_2 = (\n    merged_full\n    .join(\n        (\n            spark.createDataFrame(merged_df)\n            .select(\n                f.col('protein1').alias('stringId_A'),\n                f.col('protein2').alias('stringId_B'),\n                (f.col('combined_score') / 1000).alias('fixed_score')\n            )\n        ), \n        on=['stringId_A', 'stringId_B'], \n        how='inner'\n    )\n    .persist()\n)\n\nmerged_full_2.show()",
      "execution_count": 66,
      "outputs": [
        {
          "output_type": "stream",
          "text": "23/07/18 11:49:51 WARN TaskSetManager: Stage 26 contains a task of very large size (69545 KiB). The maximum recommended task size is 1000 KiB.\n                                                                                \r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+--------------------+--------------------+---------+---------+--------------+--------------+-----------+\n|          stringId_A|          stringId_B|score_old|score_new|score_platform|score_flatfile|fixed_score|\n+--------------------+--------------------+---------+---------+--------------+--------------+-----------+\n|9606.ENSP00000269305|9606.ENSP00000274031|    0.996|      NaN|         0.993|         0.993|      0.996|\n|9606.ENSP00000269305|9606.ENSP00000432083|    0.998|      NaN|         0.733|         0.733|      0.998|\n|9606.ENSP00000269305|9606.ENSP00000340989|    0.999|    0.999|         0.986|         0.986|      0.999|\n|9606.ENSP00000269305|9606.ENSP00000367545|      NaN|    0.985|         0.948|         0.948|      0.958|\n|9606.ENSP00000269305|9606.ENSP00000225174|      NaN|    0.993|         0.851|         0.851|      0.992|\n|9606.ENSP00000269305|9606.ENSP00000307684|      NaN|    0.983|          0.95|          0.95|      0.981|\n|9606.ENSP00000269305|9606.ENSP00000344352|      NaN|    0.982|         0.862|         0.862|      0.974|\n|9606.ENSP00000269305|9606.ENSP00000430432|    0.999|     0.99|         0.986|         0.986|      0.999|\n|9606.ENSP00000269305|9606.ENSP00000341957|    0.999|    0.999|         0.998|         0.998|      0.999|\n|9606.ENSP00000269305|9606.ENSP00000451300|    0.996|      NaN|         0.796|         0.796|      0.996|\n|9606.ENSP00000269305|9606.ENSP00000356438|    0.994|      NaN|         0.915|         0.915|      0.994|\n|9606.ENSP00000269305|9606.ENSP00000244050|    0.996|    0.992|         0.988|         0.988|      0.996|\n|9606.ENSP00000269305|9606.ENSP00000357113|      NaN|    0.988|         0.779|         0.779|      0.985|\n|9606.ENSP00000269305|9606.ENSP00000296930|    0.999|    0.997|         0.991|         0.991|      0.999|\n|9606.ENSP00000269305|9606.ENSP00000347184|    0.997|    0.998|         0.716|         0.716|      0.997|\n|9606.ENSP00000269305|9606.ENSP00000232165|    0.999|      NaN|         0.957|         0.957|      0.999|\n|9606.ENSP00000269305|9606.ENSP00000254719|    0.999|    0.999|         0.994|         0.994|      0.999|\n|9606.ENSP00000269305|9606.ENSP00000272317|    0.998|      NaN|         0.955|         0.955|      0.998|\n|9606.ENSP00000269305|9606.ENSP00000302961|    0.998|    0.997|         0.881|         0.881|      0.998|\n|9606.ENSP00000269305|9606.ENSP00000360025|      NaN|    0.991|         0.989|         0.989|      0.992|\n+--------------------+--------------------+---------+---------+--------------+--------------+-----------+\nonly showing top 20 rows\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T11:50:02.489997Z",
          "end_time": "2023-07-18T11:50:02.509019Z"
        }
      },
      "cell_type": "markdown",
      "source": "## Denerating data for v.12.0"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:00:45.485080Z",
          "end_time": "2023-07-18T13:01:52.429656Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Reading 'detailed' dataset:\ndetailed_url = 'https://stringdb-downloads.org/download/protein.links.detailed.v12.0/9606.protein.links.detailed.v12.0.txt.gz'\ndetailed_df = pd.read_csv(detailed_url, sep=' ', header=0, compression='infer')\nprint(f'Number of pairs in the \"detailed\" dataset: {len(detailed_df)}')\nprint(detailed_df.head())\n\n# Reading 'full' dataset:\nfull_url = 'https://stringdb-downloads.org/download/protein.links.full.v12.0/9606.protein.links.full.v12.0.txt.gz'\nfull_df = (\n    pd.read_csv(full_url, sep=' ', header=0, compression='infer')\n    [['protein1', 'protein2', 'homology']]\n)\n\nprint(f'Number of pairs in the \"full\" dataset: {len(full_df)}')\nprint(full_df.head())\n\n## Joining the two dataset:\nmerged_df = (\n    detailed_df\n    .merge(full_df, on=['protein1', 'protein2'], how='left')\n)\n\n# Number of pairs in the merged dataset: 11_759_454 <- 11_759_455\nprint(f'Number of pairs in the merged dataset: {len(merged_df)}')\nmerged_df.head()",
      "execution_count": 69,
      "outputs": [
        {
          "output_type": "stream",
          "text": "Number of pairs in the \"detailed\" dataset: 13715404\n               protein1              protein2  neighborhood  fusion  \\\n0  9606.ENSP00000000233  9606.ENSP00000356607             0       0   \n1  9606.ENSP00000000233  9606.ENSP00000427567             0       0   \n2  9606.ENSP00000000233  9606.ENSP00000253413             0       0   \n3  9606.ENSP00000000233  9606.ENSP00000493357             0       0   \n4  9606.ENSP00000000233  9606.ENSP00000324127             0       0   \n\n   cooccurence  coexpression  experimental  database  textmining  \\\n0            0            45           134         0           0   \n1            0             0           128         0           0   \n2            0           118            49         0           0   \n3            0            56            53         0         433   \n4            0             0            46         0         153   \n\n   combined_score  \n0             173  \n1             154  \n2             151  \n3             471  \n4             201  \nNumber of pairs in the \"full\" dataset: 13715404\n               protein1              protein2  homology\n0  9606.ENSP00000000233  9606.ENSP00000356607         0\n1  9606.ENSP00000000233  9606.ENSP00000427567         0\n2  9606.ENSP00000000233  9606.ENSP00000253413         0\n3  9606.ENSP00000000233  9606.ENSP00000493357         0\n4  9606.ENSP00000000233  9606.ENSP00000324127         0\nNumber of pairs in the merged dataset: 13715404\n",
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "execution_count": 69,
          "data": {
            "text/plain": "               protein1              protein2  neighborhood  fusion  \\\n0  9606.ENSP00000000233  9606.ENSP00000356607             0       0   \n1  9606.ENSP00000000233  9606.ENSP00000427567             0       0   \n2  9606.ENSP00000000233  9606.ENSP00000253413             0       0   \n3  9606.ENSP00000000233  9606.ENSP00000493357             0       0   \n4  9606.ENSP00000000233  9606.ENSP00000324127             0       0   \n\n   cooccurence  coexpression  experimental  database  textmining  \\\n0            0            45           134         0           0   \n1            0             0           128         0           0   \n2            0           118            49         0           0   \n3            0            56            53         0         433   \n4            0             0            46         0         153   \n\n   combined_score  homology  \n0             173         0  \n1             154         0  \n2             151         0  \n3             471         0  \n4             201         0  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>protein1</th>\n      <th>protein2</th>\n      <th>neighborhood</th>\n      <th>fusion</th>\n      <th>cooccurence</th>\n      <th>coexpression</th>\n      <th>experimental</th>\n      <th>database</th>\n      <th>textmining</th>\n      <th>combined_score</th>\n      <th>homology</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000356607</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>45</td>\n      <td>134</td>\n      <td>0</td>\n      <td>0</td>\n      <td>173</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000427567</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>128</td>\n      <td>0</td>\n      <td>0</td>\n      <td>154</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000253413</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>118</td>\n      <td>49</td>\n      <td>0</td>\n      <td>0</td>\n      <td>151</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000493357</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>56</td>\n      <td>53</td>\n      <td>0</td>\n      <td>433</td>\n      <td>471</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000324127</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>46</td>\n      <td>0</td>\n      <td>153</td>\n      <td>201</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:02:39.134347Z",
          "end_time": "2023-07-18T13:02:39.310186Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "top_count = 100\nprotein = 'TP53'\nnew_url = f'{new_api_url}/json/interaction_partners?identifiers={protein}&limit={top_count}'\n\nnew_data = (\n    pd.read_json(new_url)\n    .drop(['preferredName_A', 'ncbiTaxonId'], axis=1)\n    .rename(\n        columns={\n            col: f'{col}_new'\n            for col in 'preferredName_B score nscore fscore pscore ascore escore dscore tscore'.split(' ')\n        }\n    )\n)\n\nnew_data.head()",
      "execution_count": 71,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 71,
          "data": {
            "text/plain": "             stringId_A            stringId_B preferredName_B_new  score_new  \\\n0  9606.ENSP00000269305  9606.ENSP00000340989                 SFN      0.999   \n1  9606.ENSP00000269305  9606.ENSP00000263253               EP300      0.999   \n2  9606.ENSP00000269305  9606.ENSP00000437955               HIF1A      0.999   \n3  9606.ENSP00000269305  9606.ENSP00000362649               HDAC1      0.999   \n4  9606.ENSP00000269305  9606.ENSP00000335153            HSP90AA1      0.999   \n\n   nscore_new  fscore_new  pscore_new  ascore_new  escore_new  dscore_new  \\\n0           0           0         0.0       0.000       0.981        0.75   \n1           0           0         0.0       0.049       0.999        0.90   \n2           0           0         0.0       0.000       0.847        0.00   \n3           0           0         0.0       0.109       0.924        0.50   \n4           0           0         0.0       0.000       0.903        0.00   \n\n   tscore_new  \n0       0.859  \n1       0.998  \n2       0.994  \n3       0.994  \n4       0.995  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>preferredName_B_new</th>\n      <th>score_new</th>\n      <th>nscore_new</th>\n      <th>fscore_new</th>\n      <th>pscore_new</th>\n      <th>ascore_new</th>\n      <th>escore_new</th>\n      <th>dscore_new</th>\n      <th>tscore_new</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000340989</td>\n      <td>SFN</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.000</td>\n      <td>0.981</td>\n      <td>0.75</td>\n      <td>0.859</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000263253</td>\n      <td>EP300</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.049</td>\n      <td>0.999</td>\n      <td>0.90</td>\n      <td>0.998</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000437955</td>\n      <td>HIF1A</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.000</td>\n      <td>0.847</td>\n      <td>0.00</td>\n      <td>0.994</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000362649</td>\n      <td>HDAC1</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.109</td>\n      <td>0.924</td>\n      <td>0.50</td>\n      <td>0.994</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000335153</td>\n      <td>HSP90AA1</td>\n      <td>0.999</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0.0</td>\n      <td>0.000</td>\n      <td>0.903</td>\n      <td>0.00</td>\n      <td>0.995</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:13:40.337398Z",
          "end_time": "2023-07-18T13:13:50.146265Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "m = (\n    new_data\n    .merge(\n        merged_df.rename(columns={'protein1': 'stringId_A', 'protein2': 'stringId_B'}),\n        on=['stringId_A', 'stringId_B'], \n        how='inner'\n    )\n)\n\nm[['stringId_A', 'stringId_B', 'score_new', 'combined_score']]",
      "execution_count": 73,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 73,
          "data": {
            "text/plain": "              stringId_A            stringId_B  score_new  combined_score\n0   9606.ENSP00000269305  9606.ENSP00000340989      0.999             999\n1   9606.ENSP00000269305  9606.ENSP00000263253      0.999             999\n2   9606.ENSP00000269305  9606.ENSP00000437955      0.999             999\n3   9606.ENSP00000269305  9606.ENSP00000362649      0.999             999\n4   9606.ENSP00000269305  9606.ENSP00000335153      0.999             999\n..                   ...                   ...        ...             ...\n95  9606.ENSP00000269305  9606.ENSP00000380024      0.982             982\n96  9606.ENSP00000269305  9606.ENSP00000361186      0.982             982\n97  9606.ENSP00000269305  9606.ENSP00000394560      0.982             982\n98  9606.ENSP00000269305  9606.ENSP00000402107      0.982             982\n99  9606.ENSP00000269305  9606.ENSP00000300093      0.981             981\n\n[100 rows x 4 columns]",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>score_new</th>\n      <th>combined_score</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000340989</td>\n      <td>0.999</td>\n      <td>999</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000263253</td>\n      <td>0.999</td>\n      <td>999</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000437955</td>\n      <td>0.999</td>\n      <td>999</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000362649</td>\n      <td>0.999</td>\n      <td>999</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000335153</td>\n      <td>0.999</td>\n      <td>999</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>95</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000380024</td>\n      <td>0.982</td>\n      <td>982</td>\n    </tr>\n    <tr>\n      <th>96</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000361186</td>\n      <td>0.982</td>\n      <td>982</td>\n    </tr>\n    <tr>\n      <th>97</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000394560</td>\n      <td>0.982</td>\n      <td>982</td>\n    </tr>\n    <tr>\n      <th>98</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000402107</td>\n      <td>0.982</td>\n      <td>982</td>\n    </tr>\n    <tr>\n      <th>99</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000300093</td>\n      <td>0.981</td>\n      <td>981</td>\n    </tr>\n  </tbody>\n</table>\n<p>100 rows × 4 columns</p>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:14:33.433119Z",
          "end_time": "2023-07-18T13:14:33.446500Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "m[['stringId_A', 'stringId_B', 'escore_new', 'experimental']]",
      "execution_count": 74,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 74,
          "data": {
            "text/plain": "              stringId_A            stringId_B  escore_new  experimental\n0   9606.ENSP00000269305  9606.ENSP00000340989       0.981           981\n1   9606.ENSP00000269305  9606.ENSP00000263253       0.999           999\n2   9606.ENSP00000269305  9606.ENSP00000437955       0.847           847\n3   9606.ENSP00000269305  9606.ENSP00000362649       0.924           924\n4   9606.ENSP00000269305  9606.ENSP00000335153       0.903           903\n..                   ...                   ...         ...           ...\n95  9606.ENSP00000269305  9606.ENSP00000380024       0.874           874\n96  9606.ENSP00000269305  9606.ENSP00000361186       0.510           510\n97  9606.ENSP00000269305  9606.ENSP00000394560       0.641           641\n98  9606.ENSP00000269305  9606.ENSP00000402107       0.488           488\n99  9606.ENSP00000269305  9606.ENSP00000300093       0.737           737\n\n[100 rows x 4 columns]",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>escore_new</th>\n      <th>experimental</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000340989</td>\n      <td>0.981</td>\n      <td>981</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000263253</td>\n      <td>0.999</td>\n      <td>999</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000437955</td>\n      <td>0.847</td>\n      <td>847</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000362649</td>\n      <td>0.924</td>\n      <td>924</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000335153</td>\n      <td>0.903</td>\n      <td>903</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>95</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000380024</td>\n      <td>0.874</td>\n      <td>874</td>\n    </tr>\n    <tr>\n      <th>96</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000361186</td>\n      <td>0.510</td>\n      <td>510</td>\n    </tr>\n    <tr>\n      <th>97</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000394560</td>\n      <td>0.641</td>\n      <td>641</td>\n    </tr>\n    <tr>\n      <th>98</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000402107</td>\n      <td>0.488</td>\n      <td>488</td>\n    </tr>\n    <tr>\n      <th>99</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000300093</td>\n      <td>0.737</td>\n      <td>737</td>\n    </tr>\n  </tbody>\n</table>\n<p>100 rows × 4 columns</p>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:14:53.649691Z",
          "end_time": "2023-07-18T13:14:53.662937Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "m[['stringId_A', 'stringId_B', 'dscore_new', 'database']]",
      "execution_count": 75,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 75,
          "data": {
            "text/plain": "              stringId_A            stringId_B  dscore_new  database\n0   9606.ENSP00000269305  9606.ENSP00000340989        0.75       750\n1   9606.ENSP00000269305  9606.ENSP00000263253        0.90       900\n2   9606.ENSP00000269305  9606.ENSP00000437955        0.00         0\n3   9606.ENSP00000269305  9606.ENSP00000362649        0.50       500\n4   9606.ENSP00000269305  9606.ENSP00000335153        0.00         0\n..                   ...                   ...         ...       ...\n95  9606.ENSP00000269305  9606.ENSP00000380024        0.00         0\n96  9606.ENSP00000269305  9606.ENSP00000361186        0.75       750\n97  9606.ENSP00000269305  9606.ENSP00000394560        0.90       900\n98  9606.ENSP00000269305  9606.ENSP00000402107        0.70       700\n99  9606.ENSP00000269305  9606.ENSP00000300093        0.00         0\n\n[100 rows x 4 columns]",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>dscore_new</th>\n      <th>database</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000340989</td>\n      <td>0.75</td>\n      <td>750</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000263253</td>\n      <td>0.90</td>\n      <td>900</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000437955</td>\n      <td>0.00</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000362649</td>\n      <td>0.50</td>\n      <td>500</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000335153</td>\n      <td>0.00</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>95</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000380024</td>\n      <td>0.00</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>96</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000361186</td>\n      <td>0.75</td>\n      <td>750</td>\n    </tr>\n    <tr>\n      <th>97</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000394560</td>\n      <td>0.90</td>\n      <td>900</td>\n    </tr>\n    <tr>\n      <th>98</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000402107</td>\n      <td>0.70</td>\n      <td>700</td>\n    </tr>\n    <tr>\n      <th>99</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000300093</td>\n      <td>0.00</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n<p>100 rows × 4 columns</p>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:16:53.025832Z",
          "end_time": "2023-07-18T13:16:53.039187Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "m[['stringId_A', 'stringId_B', 'pscore_new', 'homology']].query('pscore_new != 0')",
      "execution_count": 77,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 77,
          "data": {
            "text/plain": "              stringId_A            stringId_B  pscore_new  homology\n86  9606.ENSP00000269305  9606.ENSP00000367545       0.068       876",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>stringId_A</th>\n      <th>stringId_B</th>\n      <th>pscore_new</th>\n      <th>homology</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>86</th>\n      <td>9606.ENSP00000269305</td>\n      <td>9606.ENSP00000367545</td>\n      <td>0.068</td>\n      <td>876</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:18:05.153261Z",
          "end_time": "2023-07-18T13:22:33.139845Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "merged_df.to_csv('gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz', sep=' ', index=False, compression='infer')",
      "execution_count": 78,
      "outputs": []
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:38:04.932295Z",
          "end_time": "2023-07-18T13:38:11.921046Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "%%bash\n\nfile_name='gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz'\n\ncat <(gsutil cat ${file_name} | zcat | head -n1) \\\n    <(gsutil cat ${file_name} | zcat | grep \"9606.ENSP00000269305 9606.ENSP00000340989\" | head -n1 ) \\\n    | column -t",
      "execution_count": 80,
      "outputs": [
        {
          "output_type": "stream",
          "text": "protein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score  homology\n9606.ENSP00000269305  9606.ENSP00000340989  0             0       0            0             981           750       859         999             0\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Flat file vs API reponse:\n\n**Flat file:**\n\n```\nprotein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score  homology\n9606.ENSP00000269305  9606.ENSP00000340989  0             0       0            0             981           750       859         999             0\n```\n\n\n**API reponse:**\n\n- [URL]()\n\n```json\n{\n    \"stringId_A\": \"9606.ENSP00000269305\",\n    \"stringId_B\": \"9606.ENSP00000340989\",\n    \"preferredName_A\": \"TP53\",\n    \"preferredName_B\": \"SFN\",\n    \"ncbiTaxonId\": 9606,\n    \"score\": 0.999,\n    \"nscore\": 0,\n    \"fscore\": 0,\n    \"pscore\": 0,\n    \"ascore\": 0,\n    \"escore\": 0.981,\n    \"dscore\": 0.75,\n    \"tscore\": 0.859\n}\n```"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T13:41:57.831015Z",
          "end_time": "2023-07-18T13:42:10.800736Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "%%bash\n\ngsutil cat gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz | zcat | wc -l\ngsutil cat gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz | zcat | wc -l\n\n",
      "execution_count": 81,
      "outputs": [
        {
          "output_type": "stream",
          "text": "13715405\n11759455\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-07-18T14:57:17.613495Z",
          "end_time": "2023-07-18T14:57:17.946933Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "merged_df.query(\"homology != 0\").head()",
      "execution_count": 83,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 83,
          "data": {
            "text/plain": "                 protein1              protein2  neighborhood  fusion  \\\n26   9606.ENSP00000000233  9606.ENSP00000310226             0       0   \n44   9606.ENSP00000000233  9606.ENSP00000377769             0       0   \n109  9606.ENSP00000000233  9606.ENSP00000385432             0       0   \n131  9606.ENSP00000000233  9606.ENSP00000331748             0       0   \n173  9606.ENSP00000000233  9606.ENSP00000300935             0       0   \n\n     cooccurence  coexpression  experimental  database  textmining  \\\n26             0           109           110       500         154   \n44             0             0           162         0          77   \n109            0           124             0         0          75   \n131            0            74            98         0          59   \n173            0           117           334         0         126   \n\n     combined_score  homology  \n26              648       655  \n44              217       792  \n109             178       790  \n131             158       642  \n173             460       656  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>protein1</th>\n      <th>protein2</th>\n      <th>neighborhood</th>\n      <th>fusion</th>\n      <th>cooccurence</th>\n      <th>coexpression</th>\n      <th>experimental</th>\n      <th>database</th>\n      <th>textmining</th>\n      <th>combined_score</th>\n      <th>homology</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>26</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000310226</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>109</td>\n      <td>110</td>\n      <td>500</td>\n      <td>154</td>\n      <td>648</td>\n      <td>655</td>\n    </tr>\n    <tr>\n      <th>44</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000377769</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>162</td>\n      <td>0</td>\n      <td>77</td>\n      <td>217</td>\n      <td>792</td>\n    </tr>\n    <tr>\n      <th>109</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000385432</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>124</td>\n      <td>0</td>\n      <td>0</td>\n      <td>75</td>\n      <td>178</td>\n      <td>790</td>\n    </tr>\n    <tr>\n      <th>131</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000331748</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>74</td>\n      <td>98</td>\n      <td>0</td>\n      <td>59</td>\n      <td>158</td>\n      <td>642</td>\n    </tr>\n    <tr>\n      <th>173</th>\n      <td>9606.ENSP00000000233</td>\n      <td>9606.ENSP00000300935</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>117</td>\n      <td>334</td>\n      <td>0</td>\n      <td>126</td>\n      <td>460</td>\n      <td>656</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.10.8",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "gist": {
      "id": "",
      "data": {
        "description": "GCS/Updating STRING.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}