DSuveges/Issue-3174 - Reproducing cooccurrences from matches.ipynb Secret

## Issue-3174 - Reproducing cooccurrences from matches.ipynb
{
  "cells": [
    {
      "metadata": {},
      "id": "292a5c20",
      "cell_type": "markdown",
      "source": "As we have grounded entities in the matches dataset, we can just use that dataset to generate all cooccurrences. All should be alright."
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-12T07:13:07.365019Z",
          "end_time": "2023-12-12T07:13:18.684065Z"
        },
        "trusted": true
      },
      "id": "62acc5c5",
      "cell_type": "code",
      "source": "from pyspark.sql import SparkSession, functions as f, types as t\nfrom pyspark.sql.window import Window\n\nspark = SparkSession.builder.getOrCreate()\n\nmatches_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/matches'\npmids = [\n    '35101074', '34886853', '35119481'\n]",
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": "Setting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n23/12/12 07:13:11 INFO SparkEnv: Registering MapOutputTracker\n23/12/12 07:13:11 INFO SparkEnv: Registering BlockManagerMaster\n23/12/12 07:13:11 INFO SparkEnv: Registering BlockManagerMasterHeartbeat\n23/12/12 07:13:11 INFO SparkEnv: Registering OutputCommitCoordinator\n23/12/12 07:13:17 WARN GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=211; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history\n23/12/12 07:13:17 WARN GhfsStorageStatistics: Detected potential high latency for operation op_mkdirs. latencyMs=235; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history\n",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-12-11T21:40:22.988501Z",
          "start_time": "2023-12-11T21:38:24.896127Z"
        },
        "trusted": false
      },
      "id": "c42c2e8b",
      "cell_type": "code",
      "source": "matches = (\n    spark.read.parquet(matches_path)\n    .filter(f.col('pmid').isin(pmids))\n    .persist()\n)\nmatches.count()",
      "execution_count": 2,
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": "                                                                                \r"
        },
        {
          "data": {
            "text/plain": "410"
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-11T22:41:04.761179Z",
          "end_time": "2023-12-11T22:41:05.374964Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "matches.unpersist()\n",
      "execution_count": 36,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 36,
          "data": {
            "text/plain": "DataFrame[pmid: string, pmcid: string, pubDate: string, date: date, year: int, month: int, day: int, organisms: array<string>, section: string, text: string, trace_source: string, endInSentence: bigint, label: string, labelN: string, sectionEnd: bigint, sectionStart: bigint, startInSentence: bigint, type: string, keywordId: string, isMapped: boolean]"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2023-12-11T21:43:31.270071Z",
          "start_time": "2023-12-11T21:43:30.581728Z"
        },
        "trusted": false
      },
      "id": "2ba87e88",
      "cell_type": "code",
      "source": "matches.printSchema()\nmatches.show(1, False, True)\nmatches.select('text', 'label', 'sectionEnd', 'sectionStart').show()",
      "execution_count": 4,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "root\n |-- pmid: string (nullable = true)\n |-- pmcid: string (nullable = true)\n |-- pubDate: string (nullable = true)\n |-- date: date (nullable = true)\n |-- year: integer (nullable = true)\n |-- month: integer (nullable = true)\n |-- day: integer (nullable = true)\n |-- organisms: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- section: string (nullable = true)\n |-- text: string (nullable = true)\n |-- trace_source: string (nullable = true)\n |-- endInSentence: long (nullable = true)\n |-- label: string (nullable = true)\n |-- labelN: string (nullable = true)\n |-- sectionEnd: long (nullable = true)\n |-- sectionStart: long (nullable = true)\n |-- startInSentence: long (nullable = true)\n |-- type: string (nullable = true)\n |-- keywordId: string (nullable = true)\n |-- isMapped: boolean (nullable = true)\n\n-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n pmid            | 34886853                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n pmcid           | PMC8656033                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              \n pubDate         | 2021-12-01                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              \n date            | 2021-12-01                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              \n year            | 2021                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \n month           | 12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      \n day             | 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       \n organisms       | [human]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 \n section         | abstract                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n text            | Comparing the diagnosis results of preoperative MRI scan, serum tumor markers, and postoperative pathological examination using single factor comparison, we determined the MRI scan results, the comprehensive matching rate between serum tumor markers (squamous cell carcinoma antigen (SCCA), carbohydrate antigen 125 (CA125)) and postoperative pathological results, and the differences of sensitivity, specificity, and accuracy in the prediction of lymph node metastasis and para-uterine infiltration of cervical cancer. \n trace_source    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         \n endInSentence   | 288                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \n label           | SCCA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \n labelN          | scca                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \n sectionEnd      | 1088                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \n sectionStart    | 569                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \n startInSentence | 284                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \n type            | GP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      \n keywordId       | ENSG00000057149                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         \n isMapped        | true                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \nonly showing top 1 row\n\n"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "+--------------------+--------------------+----------+------------+\n|                text|               label|sectionEnd|sectionStart|\n+--------------------+--------------------+----------+------------+\n|Comparing the dia...|                SCCA|      1088|         569|\n|SCCA was first di...|                SCCA|      null|        null|\n|Serum SCCA and CA...|                SCCA|      null|        null|\n|Among them, SCCA ...|                SCCA|      null|        null|\n|Among them, SCCA ...|                SCCA|      null|        null|\n|Among them, SCCA ...|                SCCA|      null|        null|\n|In summary, the s...|                SCCA|      null|        null|\n|Adults with confi...|portal vein throm...|      null|        null|\n|Recommendations f...|             albumin|      null|        null|\n|SCCA in periphera...|squamous cell car...|      null|        null|\n|Serum tumor marke...|squamous cell cancer|      null|        null|\n|SCCA is a highly ...|squamous cell car...|      null|        null|\n|Lenvatinib is a p...|                 KIT|      null|        null|\n|For proteins with...|            TGF-beta|      null|        null|\n|Molecular testing...|Pitt-Hopkins synd...|      null|        null|\n|Lenvatinib, a mul...|                PD-1|       336|         138|\n|Programmed cell d...|                PD-1|      null|        null|\n|Additionally, PD-...|                PD-1|      null|        null|\n|PD-L2, another li...|                PD-1|      null|        null|\n|Pembrolizumab is ...|                PD-1|      null|        null|\n+--------------------+--------------------+----------+------------+\nonly showing top 20 rows\n\n"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-11T22:43:57.702Z"
        },
        "trusted": true
      },
      "id": "5b84cb20",
      "cell_type": "code",
      "source": "matches_fields = [\n    'endInSentence', 'label', 'labelN', 'startInSentence',\n    'type', 'keywordId', 'pmid', 'text'\n]\n\n\n# Resulting pair: \"GT-DS\" -> gene/protein to disease/syndrome cooccurrence\n(\n    spark.read.parquet(matches_path)\n    .filter(f.col('type') == 'GP')\n    .alias('left')\n    .join(\n        (\n            spark.read.parquet(matches_path)\n            .select('pmid', 'text', *matches_fields)\n            .filter(f.col('type') == 'DS')\n            .alias('right')            \n        ),\n        on=[\n            (f.col('left.pmid') == f.col('right.pmid')) &\n            (f.col('left.text') == f.col('right.text'))\n        ],\n        how='inner'\n        \n    )\n    .select(\n        # Publication data:\n        f.col('left.pmid').alias('pmid'),\n        f.col('left.pmcid').alias('pmcid'), \n        f.col('left.pubDate').alias('pubDate'),\n        f.col('left.year').alias('year'), \n        f.col('left.month').alias('month'), \n        f.col('left.day').alias('day'), \n        f.col('left.organisms').alias('organisms'), \n        # Sentence data:\n        f.col('left.section').alias('section'), \n        f.col('left.text').alias('text'),\n        # Disease data:\n        f.col('left.startInSentence').alias('start1'), \n        f.col('left.endInSentence').alias('end1'), \n        f.col('left.label').alias('label1'), \n        f.col('left.type').alias('type1'), \n        f.col('left.keywordId').alias('keywordId1'),\n        # Disease data:\n        f.col('right.startInSentence').alias('start2'), \n        f.col('right.endInSentence').alias('end2'), \n        f.col('right.label').alias('label2'), \n        f.col('right.type').alias('type2'), \n        f.col('right.keywordId').alias('keywordId2'),\n        # Cooccurrence data:\n        f.concat_ws('-', f.col('left.type'), f.col('right.type')),\n    )\n    .write.mode('overwrite').parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n)\n\n",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": "23/12/11 22:44:03 WARN GhfsStorageStatistics: Detected potential high latency for operation op_delete. latencyMs=104; previousMaxLatencyMs=0; operationCount=1; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11\n23/12/11 23:27:01 WARN GhfsStorageStatistics: Detected potential high latency for operation op_delete. latencyMs=581; previousMaxLatencyMs=104; operationCount=2; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11/_temporary\n23/12/11 23:27:01 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_close_operations. latencyMs=120; previousMaxLatencyMs=0; operationCount=1; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11/_SUCCESS\n",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-11T22:00:28.460728Z",
          "end_time": "2023-12-11T22:00:29.776916Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    cooccurrences\n    .groupby('pmid1')\n    .count()\n    .show(truncate=False)\n)\n",
      "execution_count": 18,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 59:=====================================================>(513 + 9) / 522]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+--------+-----+\n|pmid1   |count|\n+--------+-----+\n|35119481|26   |\n|34886853|7    |\n|35101074|6    |\n+--------+-----+\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-11T21:56:16.440542Z",
          "end_time": "2023-12-11T21:56:16.667568Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    matches\n    .select(*[f.col(column).alias(f'{column}1') if column in matches_fields else column for column in matches.columns])\n    .filter(\n        f.col('type') == 'GP'\n    )\n    .alias('left')\n    .show()\n)",
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\n|    pmid|     pmcid|   pubDate|      date|year|month|day|           organisms| section|                text|trace_source|endInSentence1|              label1|             labelN1|sectionEnd|sectionStart|startInSentence1|type1|     keywordId1|isMapped|\n+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021|   12|  1|             [human]|abstract|Comparing the dia...|            |           288|                SCCA|                scca|      1088|         569|             284|   GP|ENSG00000057149|    true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021|   12|  1|             [human]|   intro|SCCA was first di...|            |             4|                SCCA|                scca|      null|        null|               0|   GP|ENSG00000057149|    true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021|   12|  1|             [human]|   intro|Serum SCCA and CA...|            |            10|                SCCA|                scca|      null|        null|               6|   GP|ENSG00000057149|    true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021|   12|  1|             [human]| discuss|Among them, SCCA ...|            |            16|                SCCA|                scca|      null|        null|              12|   GP|ENSG00000057149|    true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021|   12|  1|             [human]| discuss|Among them, SCCA ...|            |            16|                SCCA|                scca|      null|        null|              12|   GP|ENSG00000057149|    true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021|   12|  1|             [human]| discuss|Among them, SCCA ...|            |           107|                SCCA|                scca|      null|        null|             103|   GP|ENSG00000057149|    true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021|   12|  1|             [human]|   concl|In summary, the s...|            |            26|                SCCA|                scca|      null|        null|              22|   GP|ENSG00000057149|    true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022|    4|  1|                null|   intro|Recommendations f...|            |           301|             albumin|             albumin|      null|        null|             294|   GP|ENSG00000163631|    true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022|    4|  1|                null|   intro|Lenvatinib is a p...|            |           165|                 KIT|                 kit|      null|        null|             162|   GP|ENSG00000157404|    true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022|    1|  1|[mouse, human, ze...| results|For proteins with...|            |           207|            TGF-beta|             tgfbeta|      null|        null|             199|   GP|ENSG00000105329|    true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022|    4|  1|                null|abstract|Lenvatinib, a mul...|            |            62|                PD-1|                 1pd|       336|         138|              58|   GP|ENSG00000265681|    true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022|    4|  1|                null|   intro|Programmed cell d...|            |            37|                PD-1|                 1pd|      null|        null|              33|   GP|ENSG00000265681|    true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022|    4|  1|                null|   intro|Additionally, PD-...|            |            18|                PD-1|                 1pd|      null|        null|              14|   GP|ENSG00000265681|    true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022|    4|  1|                null|   intro|PD-L2, another li...|            |            29|                PD-1|                 1pd|      null|        null|              25|   GP|ENSG00000265681|    true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022|    4|  1|                null|   intro|Pembrolizumab is ...|            |            61|                PD-1|                 1pd|      null|        null|              57|   GP|ENSG00000265681|    true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022|    1|  1|[mouse, human, ze...| results|This is accompani...|            |           240|Laminin subunit g...|1gammalamininsubunit|      null|        null|             217|   GP|ENSG00000135862|    true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022|    1|  1|[mouse, human, ze...|   title|Novel insights in...|            |            25|               PORCN|               porcn|      null|        null|              20|   GP|ENSG00000102312|    true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022|    1|  1|[mouse, human, ze...|abstract|Goltz syndrome (G...|            |           130|               PORCN|               porcn|       151|          10|             125|   GP|ENSG00000102312|    true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022|    1|  1|[mouse, human, ze...|abstract|We report two cas...|            |           212|               PORCN|               porcn|      null|        null|             207|   GP|ENSG00000102312|    true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022|    1|  1|[mouse, human, ze...|abstract|Genotyping reveal...|            |            33|               PORCN|               porcn|      null|        null|              28|   GP|ENSG00000102312|    true|\n+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\nonly showing top 20 rows\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-11T22:12:41.804318Z",
          "end_time": "2023-12-11T22:12:42.924591Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    matches\n    .groupBy('pmid', 'text')\n    .agg(\n        f.size(f.filter(f.collect_list('type'), lambda x: x == 'DS')).alias('disaseCount'),\n        f.size(f.filter(f.collect_list('type'), lambda x: x == 'GP')).alias('targetCount')\n\n    )\n    .withColumn('expectedCooc', f.col('disaseCount') * f.col('targetCount'))\n    .filter(f.col('expectedCooc') != 0)\n    .groupBy('pmid')\n    .agg(\n        f.sum(f.col('expectedCooc')).alias('allCoocCount')\n    )\n    .show()\n)",
      "execution_count": 27,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+------------+\n|    pmid|allCoocCount|\n+--------+------------+\n|34886853|           7|\n|35101074|           6|\n|35119481|          26|\n+--------+------------+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-12T07:13:30.107730Z",
          "end_time": "2023-12-12T07:13:48.809086Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.parquet('gs://open-targets-data-releases/23.12/output/etl/parquet/literature/cooccurrences')\n    .filter(f.col('type') == 'GP-DS')\n    .count()\n)",
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "text": "                                                                                \r",
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "execution_count": 2,
          "data": {
            "text/plain": "39556102"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-12T07:14:00.521926Z",
          "end_time": "2023-12-12T07:14:07.322547Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n    .count()\n)",
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": "                                                                                \r",
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "execution_count": 3,
          "data": {
            "text/plain": "59223038"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-12T07:15:08.900489Z",
          "end_time": "2023-12-12T07:15:38.341819Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "print(\n    spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n    .select('keywordId1', 'keywordId2', 'pmid')\n    .distinct()\n    .count()\n)\n\nprint(\n    spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n    .select('keywordId1', 'keywordId2')\n    .distinct()\n    .count()\n)",
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": "                                                                                \r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "21252765\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "[Stage 18:======================================================> (33 + 1) / 34]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "2033207\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "23/12/12 07:21:47 WARN YarnAllocator: Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n23/12/12 07:21:47 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 7 for reason Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n23/12/12 07:21:47 ERROR YarnScheduler: Lost executor 7 on ds-genetics-etl-test-m.c.open-targets-eu-dev.internal: Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.10.8",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "gist": {
      "id": "",
      "data": {
        "description": "Reproducing cooccurrences from matches",
        "public": false
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
	{
	"cells": [
	{
	"metadata": {},
	"id": "292a5c20",
	"cell_type": "markdown",
	"source": "As we have grounded entities in the matches dataset, we can just use that dataset to generate all cooccurrences. All should be alright."
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-12T07:13:07.365019Z",
	"end_time": "2023-12-12T07:13:18.684065Z"
	},
	"trusted": true
	},
	"id": "62acc5c5",
	"cell_type": "code",
	"source": "from pyspark.sql import SparkSession, functions as f, types as t\nfrom pyspark.sql.window import Window\n\nspark = SparkSession.builder.getOrCreate()\n\nmatches_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/matches'\npmids = [\n '35101074', '34886853', '35119481'\n]",
	"execution_count": 1,
	"outputs": [
	{
	"output_type": "stream",
	"text": "Setting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n23/12/12 07:13:11 INFO SparkEnv: Registering MapOutputTracker\n23/12/12 07:13:11 INFO SparkEnv: Registering BlockManagerMaster\n23/12/12 07:13:11 INFO SparkEnv: Registering BlockManagerMasterHeartbeat\n23/12/12 07:13:11 INFO SparkEnv: Registering OutputCommitCoordinator\n23/12/12 07:13:17 WARN GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=211; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history\n23/12/12 07:13:17 WARN GhfsStorageStatistics: Detected potential high latency for operation op_mkdirs. latencyMs=235; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history\n",
	"name": "stderr"
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"end_time": "2023-12-11T21:40:22.988501Z",
	"start_time": "2023-12-11T21:38:24.896127Z"
	},
	"trusted": false
	},
	"id": "c42c2e8b",
	"cell_type": "code",
	"source": "matches = (\n spark.read.parquet(matches_path)\n .filter(f.col('pmid').isin(pmids))\n .persist()\n)\nmatches.count()",
	"execution_count": 2,
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": " \r"
	},
	{
	"data": {
	"text/plain": "410"
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-11T22:41:04.761179Z",
	"end_time": "2023-12-11T22:41:05.374964Z"
	},
	"trusted": true
	},
	"cell_type": "code",
	"source": "matches.unpersist()\n",
	"execution_count": 36,
	"outputs": [
	{
	"output_type": "execute_result",
	"execution_count": 36,
	"data": {
	"text/plain": "DataFrame[pmid: string, pmcid: string, pubDate: string, date: date, year: int, month: int, day: int, organisms: array<string>, section: string, text: string, trace_source: string, endInSentence: bigint, label: string, labelN: string, sectionEnd: bigint, sectionStart: bigint, startInSentence: bigint, type: string, keywordId: string, isMapped: boolean]"
	},
	"metadata": {}
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"end_time": "2023-12-11T21:43:31.270071Z",
	"start_time": "2023-12-11T21:43:30.581728Z"
	},
	"trusted": false
	},
	"id": "2ba87e88",
	"cell_type": "code",
	"source": "matches.printSchema()\nmatches.show(1, False, True)\nmatches.select('text', 'label', 'sectionEnd', 'sectionStart').show()",
	"execution_count": 4,
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": "root\n \|-- pmid: string (nullable = true)\n \|-- pmcid: string (nullable = true)\n \|-- pubDate: string (nullable = true)\n \|-- date: date (nullable = true)\n \|-- year: integer (nullable = true)\n \|-- month: integer (nullable = true)\n \|-- day: integer (nullable = true)\n \|-- organisms: array (nullable = true)\n \| \|-- element: string (containsNull = true)\n \|-- section: string (nullable = true)\n \|-- text: string (nullable = true)\n \|-- trace_source: string (nullable = true)\n \|-- endInSentence: long (nullable = true)\n \|-- label: string (nullable = true)\n \|-- labelN: string (nullable = true)\n \|-- sectionEnd: long (nullable = true)\n \|-- sectionStart: long (nullable = true)\n \|-- startInSentence: long (nullable = true)\n \|-- type: string (nullable = true)\n \|-- keywordId: string (nullable = true)\n \|-- isMapped: boolean (nullable = true)\n\n-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n pmid \| 34886853 \n pmcid \| PMC8656033 \n pubDate \| 2021-12-01 \n date \| 2021-12-01 \n year \| 2021 \n month \| 12 \n day \| 1 \n organisms \| [human] \n section \| abstract \n text \| Comparing the diagnosis results of preoperative MRI scan, serum tumor markers, and postoperative pathological examination using single factor comparison, we determined the MRI scan results, the comprehensive matching rate between serum tumor markers (squamous cell carcinoma antigen (SCCA), carbohydrate antigen 125 (CA125)) and postoperative pathological results, and the differences of sensitivity, specificity, and accuracy in the prediction of lymph node metastasis and para-uterine infiltration of cervical cancer. \n trace_source \| \n endInSentence \| 288 \n label \| SCCA \n labelN \| scca \n sectionEnd \| 1088 \n sectionStart \| 569 \n startInSentence \| 284 \n type \| GP \n keywordId \| ENSG00000057149 \n isMapped \| true \nonly showing top 1 row\n\n"
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": "+--------------------+--------------------+----------+------------+\n\| text\| label\|sectionEnd\|sectionStart\|\n+--------------------+--------------------+----------+------------+\n\|Comparing the dia...\| SCCA\| 1088\| 569\|\n\|SCCA was first di...\| SCCA\| null\| null\|\n\|Serum SCCA and CA...\| SCCA\| null\| null\|\n\|Among them, SCCA ...\| SCCA\| null\| null\|\n\|Among them, SCCA ...\| SCCA\| null\| null\|\n\|Among them, SCCA ...\| SCCA\| null\| null\|\n\|In summary, the s...\| SCCA\| null\| null\|\n\|Adults with confi...\|portal vein throm...\| null\| null\|\n\|Recommendations f...\| albumin\| null\| null\|\n\|SCCA in periphera...\|squamous cell car...\| null\| null\|\n\|Serum tumor marke...\|squamous cell cancer\| null\| null\|\n\|SCCA is a highly ...\|squamous cell car...\| null\| null\|\n\|Lenvatinib is a p...\| KIT\| null\| null\|\n\|For proteins with...\| TGF-beta\| null\| null\|\n\|Molecular testing...\|Pitt-Hopkins synd...\| null\| null\|\n\|Lenvatinib, a mul...\| PD-1\| 336\| 138\|\n\|Programmed cell d...\| PD-1\| null\| null\|\n\|Additionally, PD-...\| PD-1\| null\| null\|\n\|PD-L2, another li...\| PD-1\| null\| null\|\n\|Pembrolizumab is ...\| PD-1\| null\| null\|\n+--------------------+--------------------+----------+------------+\nonly showing top 20 rows\n\n"
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-11T22:43:57.702Z"
	},
	"trusted": true
	},
	"id": "5b84cb20",
	"cell_type": "code",
	"source": "matches_fields = [\n 'endInSentence', 'label', 'labelN', 'startInSentence',\n 'type', 'keywordId', 'pmid', 'text'\n]\n\n\n# Resulting pair: \"GT-DS\" -> gene/protein to disease/syndrome cooccurrence\n(\n spark.read.parquet(matches_path)\n .filter(f.col('type') == 'GP')\n .alias('left')\n .join(\n (\n spark.read.parquet(matches_path)\n .select('pmid', 'text', *matches_fields)\n .filter(f.col('type') == 'DS')\n .alias('right') \n ),\n on=[\n (f.col('left.pmid') == f.col('right.pmid')) &\n (f.col('left.text') == f.col('right.text'))\n ],\n how='inner'\n \n )\n .select(\n # Publication data:\n f.col('left.pmid').alias('pmid'),\n f.col('left.pmcid').alias('pmcid'), \n f.col('left.pubDate').alias('pubDate'),\n f.col('left.year').alias('year'), \n f.col('left.month').alias('month'), \n f.col('left.day').alias('day'), \n f.col('left.organisms').alias('organisms'), \n # Sentence data:\n f.col('left.section').alias('section'), \n f.col('left.text').alias('text'),\n # Disease data:\n f.col('left.startInSentence').alias('start1'), \n f.col('left.endInSentence').alias('end1'), \n f.col('left.label').alias('label1'), \n f.col('left.type').alias('type1'), \n f.col('left.keywordId').alias('keywordId1'),\n # Disease data:\n f.col('right.startInSentence').alias('start2'), \n f.col('right.endInSentence').alias('end2'), \n f.col('right.label').alias('label2'), \n f.col('right.type').alias('type2'), \n f.col('right.keywordId').alias('keywordId2'),\n # Cooccurrence data:\n f.concat_ws('-', f.col('left.type'), f.col('right.type')),\n )\n .write.mode('overwrite').parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n)\n\n",
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": "23/12/11 22:44:03 WARN GhfsStorageStatistics: Detected potential high latency for operation op_delete. latencyMs=104; previousMaxLatencyMs=0; operationCount=1; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11\n23/12/11 23:27:01 WARN GhfsStorageStatistics: Detected potential high latency for operation op_delete. latencyMs=581; previousMaxLatencyMs=104; operationCount=2; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11/_temporary\n23/12/11 23:27:01 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_close_operations. latencyMs=120; previousMaxLatencyMs=0; operationCount=1; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11/_SUCCESS\n",
	"name": "stderr"
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-11T22:00:28.460728Z",
	"end_time": "2023-12-11T22:00:29.776916Z"
	},
	"trusted": true
	},
	"cell_type": "code",
	"source": "(\n cooccurrences\n .groupby('pmid1')\n .count()\n .show(truncate=False)\n)\n",
	"execution_count": 18,
	"outputs": [
	{
	"output_type": "stream",
	"text": "[Stage 59:=====================================================>(513 + 9) / 522]\r",
	"name": "stderr"
	},
	{
	"output_type": "stream",
	"text": "+--------+-----+\n\|pmid1 \|count\|\n+--------+-----+\n\|35119481\|26 \|\n\|34886853\|7 \|\n\|35101074\|6 \|\n+--------+-----+\n\n",
	"name": "stdout"
	},
	{
	"output_type": "stream",
	"text": "\r \r",
	"name": "stderr"
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-11T21:56:16.440542Z",
	"end_time": "2023-12-11T21:56:16.667568Z"
	},
	"trusted": true
	},
	"cell_type": "code",
	"source": "(\n matches\n .select(*[f.col(column).alias(f'{column}1') if column in matches_fields else column for column in matches.columns])\n .filter(\n f.col('type') == 'GP'\n )\n .alias('left')\n .show()\n)",
	"execution_count": 13,
	"outputs": [
	{
	"output_type": "stream",
	"text": "+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\n\| pmid\| pmcid\| pubDate\| date\|year\|month\|day\| organisms\| section\| text\|trace_source\|endInSentence1\| label1\| labelN1\|sectionEnd\|sectionStart\|startInSentence1\|type1\| keywordId1\|isMapped\|\n+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\n\|34886853\|PMC8656033\|2021-12-01\|2021-12-01\|2021\| 12\| 1\| [human]\|abstract\|Comparing the dia...\| \| 288\| SCCA\| scca\| 1088\| 569\| 284\| GP\|ENSG00000057149\| true\|\n\|34886853\|PMC8656033\|2021-12-01\|2021-12-01\|2021\| 12\| 1\| [human]\| intro\|SCCA was first di...\| \| 4\| SCCA\| scca\| null\| null\| 0\| GP\|ENSG00000057149\| true\|\n\|34886853\|PMC8656033\|2021-12-01\|2021-12-01\|2021\| 12\| 1\| [human]\| intro\|Serum SCCA and CA...\| \| 10\| SCCA\| scca\| null\| null\| 6\| GP\|ENSG00000057149\| true\|\n\|34886853\|PMC8656033\|2021-12-01\|2021-12-01\|2021\| 12\| 1\| [human]\| discuss\|Among them, SCCA ...\| \| 16\| SCCA\| scca\| null\| null\| 12\| GP\|ENSG00000057149\| true\|\n\|34886853\|PMC8656033\|2021-12-01\|2021-12-01\|2021\| 12\| 1\| [human]\| discuss\|Among them, SCCA ...\| \| 16\| SCCA\| scca\| null\| null\| 12\| GP\|ENSG00000057149\| true\|\n\|34886853\|PMC8656033\|2021-12-01\|2021-12-01\|2021\| 12\| 1\| [human]\| discuss\|Among them, SCCA ...\| \| 107\| SCCA\| scca\| null\| null\| 103\| GP\|ENSG00000057149\| true\|\n\|34886853\|PMC8656033\|2021-12-01\|2021-12-01\|2021\| 12\| 1\| [human]\| concl\|In summary, the s...\| \| 26\| SCCA\| scca\| null\| null\| 22\| GP\|ENSG00000057149\| true\|\n\|35119481\|PMC8940827\|2022-04-01\|2022-04-01\|2022\| 4\| 1\| null\| intro\|Recommendations f...\| \| 301\| albumin\| albumin\| null\| null\| 294\| GP\|ENSG00000163631\| true\|\n\|35119481\|PMC8940827\|2022-04-01\|2022-04-01\|2022\| 4\| 1\| null\| intro\|Lenvatinib is a p...\| \| 165\| KIT\| kit\| null\| null\| 162\| GP\|ENSG00000157404\| true\|\n\|35101074\|PMC8802438\|2022-01-01\|2022-01-01\|2022\| 1\| 1\|[mouse, human, ze...\| results\|For proteins with...\| \| 207\| TGF-beta\| tgfbeta\| null\| null\| 199\| GP\|ENSG00000105329\| true\|\n\|35119481\|PMC8940827\|2022-04-01\|2022-04-01\|2022\| 4\| 1\| null\|abstract\|Lenvatinib, a mul...\| \| 62\| PD-1\| 1pd\| 336\| 138\| 58\| GP\|ENSG00000265681\| true\|\n\|35119481\|PMC8940827\|2022-04-01\|2022-04-01\|2022\| 4\| 1\| null\| intro\|Programmed cell d...\| \| 37\| PD-1\| 1pd\| null\| null\| 33\| GP\|ENSG00000265681\| true\|\n\|35119481\|PMC8940827\|2022-04-01\|2022-04-01\|2022\| 4\| 1\| null\| intro\|Additionally, PD-...\| \| 18\| PD-1\| 1pd\| null\| null\| 14\| GP\|ENSG00000265681\| true\|\n\|35119481\|PMC8940827\|2022-04-01\|2022-04-01\|2022\| 4\| 1\| null\| intro\|PD-L2, another li...\| \| 29\| PD-1\| 1pd\| null\| null\| 25\| GP\|ENSG00000265681\| true\|\n\|35119481\|PMC8940827\|2022-04-01\|2022-04-01\|2022\| 4\| 1\| null\| intro\|Pembrolizumab is ...\| \| 61\| PD-1\| 1pd\| null\| null\| 57\| GP\|ENSG00000265681\| true\|\n\|35101074\|PMC8802438\|2022-01-01\|2022-01-01\|2022\| 1\| 1\|[mouse, human, ze...\| results\|This is accompani...\| \| 240\|Laminin subunit g...\|1gammalamininsubunit\| null\| null\| 217\| GP\|ENSG00000135862\| true\|\n\|35101074\|PMC8802438\|2022-01-01\|2022-01-01\|2022\| 1\| 1\|[mouse, human, ze...\| title\|Novel insights in...\| \| 25\| PORCN\| porcn\| null\| null\| 20\| GP\|ENSG00000102312\| true\|\n\|35101074\|PMC8802438\|2022-01-01\|2022-01-01\|2022\| 1\| 1\|[mouse, human, ze...\|abstract\|Goltz syndrome (G...\| \| 130\| PORCN\| porcn\| 151\| 10\| 125\| GP\|ENSG00000102312\| true\|\n\|35101074\|PMC8802438\|2022-01-01\|2022-01-01\|2022\| 1\| 1\|[mouse, human, ze...\|abstract\|We report two cas...\| \| 212\| PORCN\| porcn\| null\| null\| 207\| GP\|ENSG00000102312\| true\|\n\|35101074\|PMC8802438\|2022-01-01\|2022-01-01\|2022\| 1\| 1\|[mouse, human, ze...\|abstract\|Genotyping reveal...\| \| 33\| PORCN\| porcn\| null\| null\| 28\| GP\|ENSG00000102312\| true\|\n+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\nonly showing top 20 rows\n\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-11T22:12:41.804318Z",
	"end_time": "2023-12-11T22:12:42.924591Z"
	},
	"trusted": true
	},
	"cell_type": "code",
	"source": "(\n matches\n .groupBy('pmid', 'text')\n .agg(\n f.size(f.filter(f.collect_list('type'), lambda x: x == 'DS')).alias('disaseCount'),\n f.size(f.filter(f.collect_list('type'), lambda x: x == 'GP')).alias('targetCount')\n\n )\n .withColumn('expectedCooc', f.col('disaseCount') * f.col('targetCount'))\n .filter(f.col('expectedCooc') != 0)\n .groupBy('pmid')\n .agg(\n f.sum(f.col('expectedCooc')).alias('allCoocCount')\n )\n .show()\n)",
	"execution_count": 27,
	"outputs": [
	{
	"output_type": "stream",
	"text": "+--------+------------+\n\| pmid\|allCoocCount\|\n+--------+------------+\n\|34886853\| 7\|\n\|35101074\| 6\|\n\|35119481\| 26\|\n+--------+------------+\n\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-12T07:13:30.107730Z",
	"end_time": "2023-12-12T07:13:48.809086Z"
	},
	"trusted": true
	},
	"cell_type": "code",
	"source": "(\n spark.read.parquet('gs://open-targets-data-releases/23.12/output/etl/parquet/literature/cooccurrences')\n .filter(f.col('type') == 'GP-DS')\n .count()\n)",
	"execution_count": 2,
	"outputs": [
	{
	"output_type": "stream",
	"text": " \r",
	"name": "stderr"
	},
	{
	"output_type": "execute_result",
	"execution_count": 2,
	"data": {
	"text/plain": "39556102"
	},
	"metadata": {}
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-12T07:14:00.521926Z",
	"end_time": "2023-12-12T07:14:07.322547Z"
	},
	"trusted": true
	},
	"cell_type": "code",
	"source": "(\n spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n .count()\n)",
	"execution_count": 3,
	"outputs": [
	{
	"output_type": "stream",
	"text": " \r",
	"name": "stderr"
	},
	{
	"output_type": "execute_result",
	"execution_count": 3,
	"data": {
	"text/plain": "59223038"
	},
	"metadata": {}
	}
	]
	},
	{
	"metadata": {
	"ExecuteTime": {
	"start_time": "2023-12-12T07:15:08.900489Z",
	"end_time": "2023-12-12T07:15:38.341819Z"
	},
	"trusted": true
	},
	"cell_type": "code",
	"source": "print(\n spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n .select('keywordId1', 'keywordId2', 'pmid')\n .distinct()\n .count()\n)\n\nprint(\n spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n .select('keywordId1', 'keywordId2')\n .distinct()\n .count()\n)",
	"execution_count": 4,
	"outputs": [
	{
	"output_type": "stream",
	"text": " \r",
	"name": "stderr"
	},
	{
	"output_type": "stream",
	"text": "21252765\n",
	"name": "stdout"
	},
	{
	"output_type": "stream",
	"text": "[Stage 18:======================================================> (33 + 1) / 34]\r",
	"name": "stderr"
	},
	{
	"output_type": "stream",
	"text": "2033207\n",
	"name": "stdout"
	},
	{
	"output_type": "stream",
	"text": "23/12/12 07:21:47 WARN YarnAllocator: Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n23/12/12 07:21:47 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 7 for reason Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n23/12/12 07:21:47 ERROR YarnScheduler: Lost executor 7 on ds-genetics-etl-test-m.c.open-targets-eu-dev.internal: Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n",
	"name": "stderr"
	}
	]
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	}
	],
	"metadata": {
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3",
	"language": "python"
	},
	"language_info": {
	"name": "python",
	"version": "3.10.8",
	"mimetype": "text/x-python",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"pygments_lexer": "ipython3",
	"nbconvert_exporter": "python",
	"file_extension": ".py"
	},
	"gist": {
	"id": "",
	"data": {
	"description": "Reproducing cooccurrences from matches",
	"public": false
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}