DSuveges/community-1306 - missing publication from literature.ipynb

## community-1306 - missing publication from literature.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# Missing publication from literature\n\nUser [reported](https://community.opentargets.org/t/missing-paper-from-text-mining/1306) an obvious publication was missing from our literature dataset. We need to find out where did our piplines leaked.\n\n"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:34:54.249195Z",
          "end_time": "2023-12-07T12:34:54.255707Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "from pyspark.sql import SparkSession, functions as f, types as t\n\nspark = SparkSession.builder.getOrCreate()\n\npmid = '35101074'\npmcid = 'PMC8802438'\n\nmatches_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/matches/'\ncoocurrences_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/cooccurrences'\nfailed_matches_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/failedMatches/'\nraw_full_text_path = 'gs://otar025-epmc/ml02/fulltext'\n\n",
      "execution_count": 11,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Is the paper in the matches dataset?"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:31:27.034659Z",
          "end_time": "2023-12-07T12:32:08.662741Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.parquet(matches_path)\n    .filter(\n        (f.col('pmid') == pmid) | \n        (f.col('pmcid') == pmcid)\n    )\n    .show(1, False, True)\n)",
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 31:======================================================>(99 + 1) / 100]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n pmid            | 35101074                                                                                                                                                                                                                                        \n pmcid           | PMC8802438                                                                                                                                                                                                                                      \n pubDate         | 2022-01-01                                                                                                                                                                                                                                      \n date            | 2022-01-01                                                                                                                                                                                                                                      \n year            | 2022                                                                                                                                                                                                                                            \n month           | 1                                                                                                                                                                                                                                               \n day             | 1                                                                                                                                                                                                                                               \n organisms       | [mouse, human, zebrafish, rabbit, Human]                                                                                                                                                                                                        \n section         | results                                                                                                                                                                                                                                         \n text            | For proteins with decreased abundance in the secretome, proteomaps-based in silico studies (based on the dysregulation of ECM proteins) hint toward perturbed signalling pathways including PI3K-AKT-, TGF-beta- and Hippo-signalling (Fig. 2). \n trace_source    |                                                                                                                                                                                                                                                 \n endInSentence   | 207                                                                                                                                                                                                                                             \n label           | TGF-beta                                                                                                                                                                                                                                        \n labelN          | tgfbeta                                                                                                                                                                                                                                         \n sectionEnd      | null                                                                                                                                                                                                                                            \n sectionStart    | null                                                                                                                                                                                                                                            \n startInSentence | 199                                                                                                                                                                                                                                             \n type            | GP                                                                                                                                                                                                                                              \n keywordId       | ENSG00000105329                                                                                                                                                                                                                                 \n isMapped        | true                                                                                                                                                                                                                                            \nonly showing top 1 row\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:32:12.509411Z",
          "end_time": "2023-12-07T12:32:27.272344Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "matches = (\n    spark.read.parquet(matches_path)\n    .filter(\n        (f.col('pmid') == pmid) | \n        (f.col('pmcid') == pmcid)\n    )\n    .select(\n        'pmid', 'pmcid', 'label', 'type', 'keywordId'\n    )\n    .distinct()\n    .persist()\n)\n\nprint(matches.count())\nmatches.show(10, truncate=False)",
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 33:=====================================================>(521 + 1) / 522]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "93\n+--------+----------+----------------------------------------------------------------+----+---------------+\n|pmid    |pmcid     |label                                                           |type|keywordId      |\n+--------+----------+----------------------------------------------------------------+----+---------------+\n|35101074|PMC8802438|penicillin                                                      |CD  |CHEMBL1223     |\n|35101074|PMC8802438|connective tissue disorder                                      |DS  |EFO_1001986    |\n|35101074|PMC8802438|FDH                                                             |DS  |MONDO_0010592  |\n|35101074|PMC8802438|Heat shock cognate 71 kDa protein                               |GP  |ENSG00000109971|\n|35101074|PMC8802438|Ectonucleotide pyrophosphatase/phosphodiesterase family member 2|GP  |ENSG00000136960|\n|35101074|PMC8802438|Alpha-actinin-4                                                 |GP  |ENSG00000130402|\n|35101074|PMC8802438|amino acid                                                      |CD  |CHEMBL1201498  |\n|35101074|PMC8802438|astigmatism                                                     |DS  |HP_0000483     |\n|35101074|PMC8802438|epilepsy                                                        |DS  |EFO_0000474    |\n|35101074|PMC8802438|PERK                                                            |GP  |ENSG00000172071|\n+--------+----------+----------------------------------------------------------------+----+---------------+\nonly showing top 10 rows\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:32:31.408990Z",
          "end_time": "2023-12-07T12:32:31.764468Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "matches.groupby('type').count().show()",
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+----+-----+\n|type|count|\n+----+-----+\n|  DS|   33|\n|  GP|   47|\n|  CD|   13|\n+----+-----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:38:53.321392Z",
          "end_time": "2023-12-07T12:39:03.768942Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "cooc = (\n    spark.read.parquet(coocurrences_path)\n    .filter(\n        ((f.col('pmid') == pmid) | \n        (f.col('pmcid') == pmcid))\n    )\n    .persist()\n)\n\ncooc.count()",
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "stream",
          "text": "                                                                                \r",
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "execution_count": 19,
          "data": {
            "text/plain": "7"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:40:04.804022Z",
          "end_time": "2023-12-07T12:40:04.990647Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "cooc.select('label1', 'keywordId1', 'label2', 'keywordId2', 'section', 'type').show(truncate=False)",
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+---------+---------------+-------------------+-------------+--------+-----+\n|label1   |keywordId1     |label2             |keywordId2   |section |type |\n+---------+---------------+-------------------+-------------+--------+-----+\n|NF-kappaB|ENSG00000109320|glucose            |CHEMBL1222250|results |GP-CD|\n|PORCN    |ENSG00000102312|amino acid         |CHEMBL1201498|intro   |GP-CD|\n|PORCN    |ENSG00000102312|amino acid         |CHEMBL1201498|discuss |GP-CD|\n|GS       |EFO_0007285    |amino acid         |CHEMBL1201498|intro   |DS-CD|\n|PORCN    |ENSG00000102312|developmental delay|HP_0001263   |abstract|GP-DS|\n|PORCN    |ENSG00000102312|Goltz syndrome     |MONDO_0010592|abstract|GP-DS|\n|PORCN    |ENSG00000102312|GS                 |EFO_0007285  |intro   |GP-DS|\n+---------+---------------+-------------------+-------------+--------+-----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:42:57.925936Z",
          "end_time": "2023-12-07T12:43:13.913757Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.parquet(matches_path)\n    .filter(\n        (f.col('pmid') == pmid) | \n        (f.col('pmcid') == pmcid)\n    )\n    .filter(\n        (f.col('type') == 'DS') &\n        (f.col('section') == 'abstract')\n    )\n    .select('label', 'keywordId', 'section')\n    .show(50, truncate=False)\n)",
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 109:====================================================>(396 + 1) / 397]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+-------------------+-------------+--------+\n|label              |keywordId    |section |\n+-------------------+-------------+--------+\n|microcephaly       |HP_0000252   |abstract|\n|microcephaly       |HP_0000252   |abstract|\n|Goltz syndrome     |MONDO_0010592|abstract|\n|cerebral atrophy   |HP_0002059   |abstract|\n|GS                 |EFO_0007285  |abstract|\n|GS                 |EFO_0007285  |abstract|\n|GS                 |EFO_0007285  |abstract|\n|GS                 |EFO_0007285  |abstract|\n|GS                 |EFO_0007285  |abstract|\n|developmental delay|HP_0001263   |abstract|\n|developmental delay|HP_0001263   |abstract|\n|epilepsy           |EFO_0000474  |abstract|\n|epilepsy           |EFO_0000474  |abstract|\n+-------------------+-------------+--------+\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:44:31.073787Z",
          "end_time": "2023-12-07T12:44:38.847942Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.parquet(matches_path)\n    .filter(\n        (f.col('pmid') == pmid) | \n        (f.col('pmcid') == pmcid)\n    )\n    .filter(\n        (f.col('label') == 'epilepsy') &\n        (f.col('section') == 'abstract')\n    )\n    .select('label', 'keywordId', 'text')\n    .show(50, truncate=False)\n)",
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 115:====================================================>(395 + 2) / 397]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|label   |keywordId  |text                                                                                                                                                                                                                                                            |\n+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|epilepsy|EFO_0000474|We report two cases: one girl suffering from typical skin and skeletal abnormalities, developmental delay, microcephaly, thin corpus callosum, periventricular gliosis and drug-resistant epilepsy caused by a PORCN nonsense-mutation (c.283C > T, p.Arg95Ter).|\n|epilepsy|EFO_0000474|The other patient is a boy with a supernumerary nipple and skeletal anomalies but also, developmental delay, microcephaly, cerebral atrophy with delayed myelination and drug-resistant epilepsy as predominant features.                                       |\n+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r[Stage 115:====================================================>(396 + 1) / 397]\r\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T12:45:13.873730Z",
          "end_time": "2023-12-07T12:45:23.703238Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.parquet(matches_path)\n    .filter(\n        (f.col('pmid') == pmid) | \n        (f.col('pmcid') == pmcid)\n    )\n    .filter(\n        (f.col('label') == 'PORCN') &\n        (f.col('section') == 'abstract')\n    )\n    .select('label', 'keywordId', 'text')\n    .show(50, truncate=False)\n)",
      "execution_count": 27,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 121:====================================================>(396 + 1) / 397]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+-----+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|label|keywordId      |text                                                                                                                                                                                                                                                                       |\n+-----+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|PORCN|ENSG00000102312|Goltz syndrome (GS) is a X-linked disorder defined by defects of mesodermal- and ectodermal-derived structures and caused by PORCN mutations.                                                                                                                              |\n|PORCN|ENSG00000102312|We report two cases: one girl suffering from typical skin and skeletal abnormalities, developmental delay, microcephaly, thin corpus callosum, periventricular gliosis and drug-resistant epilepsy caused by a PORCN nonsense-mutation (c.283C > T, p.Arg95Ter).           |\n|PORCN|ENSG00000102312|Genotyping revealed a novel PORCN missense-mutation (c.847G > C, p.Asp283His) absent in the Genome Aggregation Database (gnomAD) but also identified in his asymptomatic mother.                                                                                           |\n|PORCN|ENSG00000102312|Given that non-random X-chromosome inactivation was excluded in the mother, fibroblasts of the index had been analyzed for PORCN protein-abundance and -distribution, vulnerability against additional ER-stress burden as well as for protein secretion revealing changes.|\n+-----+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T13:03:22.487874Z",
          "end_time": "2023-12-07T13:07:26.952044Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "raw_input = (\n    spark.read.json('gs://otar025-epmc/ml02/fulltext/', recursiveFileLookup=True)\n    .filter(f.col('pmcid') == pmcid)\n    .select(\n        'pmid',\n        'pmcid',\n        'timestamp',\n        f.input_file_name().alias('filename'),\n        f.explode(f.col('sentences')).alias('sentence')\n    )\n    .persist()\n)",
      "execution_count": 35,
      "outputs": [
        {
          "output_type": "stream",
          "text": "23/12/07 13:03:38 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_operations. latencyMs=337; previousMaxLatencyMs=5; operationCount=9502; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history/application_1701951553986_0001.inprogress\n                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T13:07:26.955089Z",
          "end_time": "2023-12-07T13:07:26.960390Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "raw_input.printSchema()",
      "execution_count": 36,
      "outputs": [
        {
          "output_type": "stream",
          "text": "root\n |-- pmid: string (nullable = true)\n |-- pmcid: string (nullable = true)\n |-- timestamp: string (nullable = true)\n |-- filename: string (nullable = false)\n |-- sentence: struct (nullable = true)\n |    |-- co-occurrence: array (nullable = true)\n |    |    |-- element: struct (containsNull = true)\n |    |    |    |-- association: long (nullable = true)\n |    |    |    |-- end1: long (nullable = true)\n |    |    |    |-- end2: long (nullable = true)\n |    |    |    |-- label1: string (nullable = true)\n |    |    |    |-- label2: string (nullable = true)\n |    |    |    |-- sentEvidenceScore: double (nullable = true)\n |    |    |    |-- start1: long (nullable = true)\n |    |    |    |-- start2: long (nullable = true)\n |    |    |    |-- type: string (nullable = true)\n |    |-- matches: array (nullable = true)\n |    |    |-- element: struct (containsNull = true)\n |    |    |    |-- endInSentence: long (nullable = true)\n |    |    |    |-- label: string (nullable = true)\n |    |    |    |-- startInSentence: long (nullable = true)\n |    |    |    |-- type: string (nullable = true)\n |    |-- section: string (nullable = true)\n |    |-- sectionEnd: long (nullable = true)\n |    |-- sectionStart: long (nullable = true)\n |    |-- text: string (nullable = true)\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T13:07:48.678518Z",
          "end_time": "2023-12-07T13:09:44.489341Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "raw_input.count()",
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "stream",
          "text": "                                                                                \r",
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "execution_count": 37,
          "data": {
            "text/plain": "278"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T13:12:04.868310Z",
          "end_time": "2023-12-07T13:12:05.981381Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    raw_input\n    .select(\n        'pmid',\n        f.explode(f.col('sentence.co-occurrence')).alias('coocurrence'),\n        f.col('sentence.text').alias('text'),\n         f.col('sentence.section').alias('section'),\n    )\n    .filter(f.col('section') == 'ABSTRACT')\n    .count()\n#     .show()\n)",
      "execution_count": 40,
      "outputs": [
        {
          "output_type": "stream",
          "text": "                                                                                \r",
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "execution_count": 40,
          "data": {
            "text/plain": "4"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T14:13:56.802160Z",
          "end_time": "2023-12-07T14:13:58.144476Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    raw_input\n    .select(\n        'pmid','filename',\n        f.explode(f.col('sentence.co-occurrence')).alias('coocurrence'),\n        f.col('sentence.text').alias('text'),\n        f.col('sentence.section').alias('section'),\n    )\n    .filter(f.col('section') == 'ABSTRACT')\n    .select(\n        '*',\n        f.col('coocurrence.label1').alias('label1'),\n        f.col('coocurrence.label2').alias('label2')\n    )\n    .filter(f.col('text').startswith(sentence))\n    .drop('coocurrence', 'filename')\n    .distinct()\n#     .count()\n    .show(1000)\n)",
      "execution_count": 52,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[Stage 204:=========================================>        (2274 + 50) / 2727]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+--------+--------------------+--------+------+-------------------+\n|    pmid|                text| section|label1|             label2|\n+--------+--------------------+--------+------+-------------------+\n|35101074|We report two cas...|ABSTRACT| PORCN|developmental delay|\n+--------+--------------------+--------+------+-------------------+\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T14:40:13.084209Z",
          "end_time": "2023-12-07T14:40:13.087609Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "path_to_file = 'gs://otar025-epmc/ml02/fulltext/2023_05_30/NMP_patch-29-05-2023-201.jsonl'",
      "execution_count": 53,
      "outputs": []
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T14:12:46.722009Z",
          "end_time": "2023-12-07T14:12:47.952739Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "sentence = 'We report two cases'\n\n(\n    raw_input\n    .select(\n        'pmid','filename',\n        f.explode(f.col('sentence.matches')).alias('matches'),\n        f.col('sentence.text').alias('text'),\n        f.col('sentence.section').alias('section'),\n    )\n    .filter(f.col('text').startswith(sentence))\n    .select(\n        '*',\n        f.col('matches.label').alias('label'),\n        f.col('matches.type').alias('type')\n    )\n    .drop('matches', 'filename')\n    .distinct()\n#     .count()\n    .show(1000)\n)",
      "execution_count": 50,
      "outputs": [
        {
          "output_type": "stream",
          "text": "\r[Stage 195:==============================>                   (1690 + 34) / 2727]\r\r[Stage 195:=========================================>        (2245 + 33) / 2727]\r",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+--------+--------------------+--------+-------------------+----+\n|    pmid|                text| section|              label|type|\n+--------+--------------------+--------+-------------------+----+\n|35101074|We report two cas...|ABSTRACT|    periventricular|  DS|\n|35101074|We report two cas...|ABSTRACT|developmental delay|  DS|\n|35101074|We report two cas...|ABSTRACT|       microcephaly|  DS|\n|35101074|We report two cas...|ABSTRACT|              PORCN|  GP|\n|35101074|We report two cas...|ABSTRACT|           epilepsy|  DS|\n+--------+--------------------+--------+-------------------+----+\n\n",
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": "\r[Stage 195:==================================================>(2705 + 8) / 2727]\r\r                                                                                \r",
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Are there always one cooccurrence from one sentence?"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:16:57.965064Z",
          "end_time": "2023-12-07T15:16:58.791155Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "cooc_count = (\n    spark.read.json(path_to_file)\n    .withColumn('sentence', f.explode(f.col('sentences')))\n    .select(\n        'pmid',\n        f.col('sentence.text').alias('text'),\n        f.col('sentence.section').alias('section'),\n        f.size(\n            f.filter(\n                f.col('sentence.matches'),\n                lambda col: col.type == 'GP'\n            )\n        ).alias('match_gp'),\n        f.size(\n            f.filter(\n                f.col('sentence.matches'),\n                lambda col: col.type == 'DS'\n            )\n        ).alias('match_ds'),\n        f.size(\n            f.filter(\n                f.col('sentence.co-occurrence'),\n                lambda col: col.type == 'GP-DS'\n            )\n        ).alias('coocc_ds_gp')\n    )\n    .filter(f.col('coocc_ds_gp')>0)\n    .persist()\n)\n\ncooc_count.show()",
      "execution_count": 68,
      "outputs": [
        {
          "output_type": "stream",
          "text": "23/12/07 15:16:58 WARN CacheManager: Asked to cache already cached data.\n",
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": "+--------+--------------------+--------+--------+--------+-----------+\n|    pmid|                text| section|match_gp|match_ds|coocc_ds_gp|\n+--------+--------------------+--------+--------+--------+-----------+\n|34886853|Clinical value of...|   title|       1|       2|          2|\n|34886853|Comparing the dia...|ABSTRACT|       4|       4|          9|\n|34886853|The levels of SCC...|ABSTRACT|       1|       1|          1|\n|34886853|SCCA was first di...|   INTRO|       1|       1|          1|\n|34886853|It has been confi...|   INTRO|       1|       2|          2|\n|34886853|Serum SCCA and CA...|   INTRO|       2|       2|          4|\n|34886853|This study intend...|   INTRO|       1|       2|          2|\n|34886853|Inclusion criteri...| METHODS|       1|       2|          1|\n|34886853|Relationship betw...| RESULTS|       1|       1|          1|\n|34886853|The positive rate...| RESULTS|       1|       4|          4|\n|34886853|Among them, SCCA ...| DISCUSS|       2|       2|          4|\n|34886853|Among them, SCCA ...| DISCUSS|       3|       1|          2|\n|34886853|CA125 protein is ...| DISCUSS|       1|       3|          3|\n|34886853|This study found ...| DISCUSS|       1|       3|          1|\n|34886853|In summary, the s...|   CONCL|       2|       5|         10|\n|34889997|Table 2 shows tha...| RESULTS|       1|       1|          1|\n|34889997|Likewise, LH leve...| RESULTS|       1|       1|          1|\n|34889997|Besides, AMH was ...| DISCUSS|       2|       1|          2|\n|34889997|Serum AMH has bee...| DISCUSS|       1|       1|          1|\n|35119481|Lenvatinib, a mul...|ABSTRACT|       1|       1|          1|\n+--------+--------------------+--------+--------+--------+-----------+\nonly showing top 20 rows\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:35:58.602676Z",
          "end_time": "2023-12-07T15:35:58.811195Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "print(cooc_count.count())\nprint(\n    cooc_count\n    .filter(\n        f.col('coocc_ds_gp') != (f.col('match_gp') * f.col('match_ds'))\n    )\n    .count()\n)\n(\n    cooc_count\n    .filter(\n        f.col('coocc_ds_gp') != (f.col('match_gp') * f.col('match_ds'))\n    )\n    .show()\n)",
      "execution_count": 84,
      "outputs": [
        {
          "output_type": "stream",
          "text": "2612\n816\n+--------+--------------------+--------+--------+--------+-----------+\n|    pmid|                text| section|match_gp|match_ds|coocc_ds_gp|\n+--------+--------------------+--------+--------+--------+-----------+\n|34886853|Comparing the dia...|ABSTRACT|       4|       4|          9|\n|34886853|Inclusion criteri...| METHODS|       1|       2|          1|\n|34886853|Among them, SCCA ...| DISCUSS|       3|       1|          2|\n|34886853|This study found ...| DISCUSS|       1|       3|          1|\n|35119481|Angiogenesis play...|   INTRO|       4|       2|          5|\n|35119481|In patients with ...|   INTRO|       3|       1|          1|\n|35119481|Recommendations f...|   INTRO|       2|       2|          3|\n|35119507|Our study suggest...|ABSTRACT|       3|       1|          1|\n|35119507|The secretion of ...| RESULTS|       4|       1|          1|\n|35119507|Mouse brain tissu...| RESULTS|       3|       1|          1|\n|35089541|This study showed...|ABSTRACT|       5|       2|          9|\n|35089541|Several studies, ...|   INTRO|       4|       2|          7|\n|35089541|Patient were grou...| METHODS|       1|       8|          6|\n|35089541|Regarding EPC OC ...| RESULTS|       4|       1|          3|\n|35089541|Our study showed ...| DISCUSS|       3|       2|          5|\n|35090502|All these data su...|ABSTRACT|       2|       1|          1|\n|35090502|In this study, we...|   INTRO|       2|       2|          2|\n|35090502|NQO1 overexpressi...| RESULTS|       4|       1|          1|\n|35090502|Therefore, to fur...| RESULTS|       3|       1|          1|\n|35090502|Previous studies ...| DISCUSS|       1|       2|          1|\n+--------+--------------------+--------+--------+--------+-----------+\nonly showing top 20 rows\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:24:05.219318Z",
          "end_time": "2023-12-07T15:24:05.534337Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "print(\n    cooc_count\n    .filter(\n        ~((f.col('match_gp') == 1) & (f.col('match_ds') ==1))\n    )\n    .count()\n)\nprint(\n    cooc_count\n    .filter(\n        ~((f.col('match_gp') == 1) & (f.col('match_ds') ==1))\n    )\n    .filter(\n        f.col('coocc_ds_gp') != (f.col('match_gp') * f.col('match_ds'))\n    )\n    .count()\n)\nprint(\n    cooc_count\n    .filter(\n        ~((f.col('match_gp') == 1) & (f.col('match_ds') ==1))\n    )\n    .filter(\n        f.col('coocc_ds_gp') > (f.col('match_gp') * f.col('match_ds'))\n    )\n    .count()\n)",
      "execution_count": 75,
      "outputs": [
        {
          "output_type": "stream",
          "text": "1634\n816\n0\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Investigating this strange case:\n\n```\n+--------+--------------------+--------+--------+--------+-----------+\n|    pmid|                text| section|match_gp|match_ds|coocc_ds_gp|\n+--------+--------------------+--------+--------+--------+-----------+\n|34886853|Comparing the dia...|ABSTRACT|       4|       4|          9|\n+--------+--------------------+--------+--------+--------+-----------+\n```"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:33:02.882964Z",
          "end_time": "2023-12-07T15:33:03.680864Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.json(path_to_file)\n    .filter(f.col('pmid') == '34886853')\n    .withColumn('sentence', f.explode(f.col('sentences')))\n    .filter(f.col('sentence.text').startswith('Comparing the dia'))\n    .select(\n        'sentence.text'\n    )\n    .show(1000, truncate=False)\n)",
      "execution_count": 83,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|Comparing the diagnosis results of preoperative MRI scan, serum tumor markers, and postoperative pathological examination using single factor comparison, we determined the MRI scan results, the comprehensive matching rate between serum tumor markers (squamous cell carcinoma antigen (SCCA), carbohydrate antigen 125 (CA125)) and postoperative pathological results, and the differences of sensitivity, specificity, and accuracy in the prediction of lymph node metastasis and para-uterine infiltration of cervical cancer.|\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:29:43.857506Z",
          "end_time": "2023-12-07T15:29:44.646582Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.json(path_to_file)\n    .filter(f.col('pmid') == '34886853')\n    .withColumn('sentence', f.explode(f.col('sentences')))\n    .filter(f.col('sentence.text').startswith('Comparing the dia'))\n    .select(\n        'pmid', 'pmcid', f.explode('sentence.matches').alias('match')\n    )\n    .select(\n        'pmid', 'pmcid', \n        f.col('match.label').alias('label'),\n        f.col('match.type').alias('type')\n    )\n    .show(1000, truncate=False)\n)\n",
      "execution_count": 80,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+----------+-------------------------------+----+\n|pmid    |pmcid     |label                          |type|\n+--------+----------+-------------------------------+----+\n|34886853|PMC8656033|tumor                          |DS  |\n|34886853|PMC8656033|tumor                          |DS  |\n|34886853|PMC8656033|squamous cell carcinoma antigen|GP  |\n|34886853|PMC8656033|SCCA                           |GP  |\n|34886853|PMC8656033|carbohydrate antigen 125       |GP  |\n|34886853|PMC8656033|CA125                          |GP  |\n|34886853|PMC8656033|lymph node metastasis          |DS  |\n|34886853|PMC8656033|cervical cancer                |DS  |\n+--------+----------+-------------------------------+----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:31:29.387935Z",
          "end_time": "2023-12-07T15:31:30.127261Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "(\n    spark.read.json(path_to_file)\n    .filter(f.col('pmid') == '34886853')\n    .withColumn('sentence', f.explode(f.col('sentences')))\n    .filter(f.col('sentence.text').startswith('Comparing the dia'))\n    .select(\n        'pmid', 'pmcid', f.explode('sentence.co-occurrence').alias('co')\n    )\n    .select(\n        'pmid', 'pmcid', \n        f.col('co.label1').alias('label1'),\n        f.col('co.label2').alias('label2'),\n        f.col('co.type').alias('type')\n    )\n    .show(1000, truncate=False)\n)",
      "execution_count": 82,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+----------+-------------------------------+---------------------+-----+\n|pmid    |pmcid     |label1                         |label2               |type |\n+--------+----------+-------------------------------+---------------------+-----+\n|34886853|PMC8656033|squamous cell carcinoma antigen|tumor                |GP-DS|\n|34886853|PMC8656033|squamous cell carcinoma antigen|lymph node metastasis|GP-DS|\n|34886853|PMC8656033|squamous cell carcinoma antigen|cervical cancer      |GP-DS|\n|34886853|PMC8656033|SCCA                           |lymph node metastasis|GP-DS|\n|34886853|PMC8656033|SCCA                           |cervical cancer      |GP-DS|\n|34886853|PMC8656033|carbohydrate antigen 125       |lymph node metastasis|GP-DS|\n|34886853|PMC8656033|carbohydrate antigen 125       |cervical cancer      |GP-DS|\n|34886853|PMC8656033|CA125                          |lymph node metastasis|GP-DS|\n|34886853|PMC8656033|CA125                          |cervical cancer      |GP-DS|\n+--------+----------+-------------------------------+---------------------+-----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Doing the same for \n```\n35119481|Angiogenesis play\n```"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:41:25.075331Z",
          "end_time": "2023-12-07T15:41:26.688174Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "def examine_sentence(pmid:str, sentence_start:str) -> None:\n\n    (\n        spark.read.json(path_to_file)\n        .filter(f.col('pmid') == pmid)\n        .withColumn('sentence', f.explode(f.col('sentences')))\n        .filter(f.col('sentence.text').startswith(sentence_start))\n        .select(\n            'pmid', 'pmcid', f.explode('sentence.matches').alias('match')\n        )\n        .select(\n            'pmid', 'pmcid', \n            f.col('match.label').alias('label'),\n            f.col('match.type').alias('type')\n        )\n        .filter(f.col('type').isin(['DS', 'GP']))      \n        .show(1000, truncate=False)\n    )\n    (\n        spark.read.json(path_to_file)\n        .filter(f.col('pmid') == pmid)\n        .withColumn('sentence', f.explode(f.col('sentences')))\n        .filter(f.col('sentence.text').startswith(sentence_start))\n        .select(\n            'pmid', 'pmcid', f.explode('sentence.co-occurrence').alias('co')\n        )\n        .select(\n            'pmid', 'pmcid', \n            f.col('co.label1').alias('label1'),\n            f.col('co.label2').alias('label2'),\n            f.col('co.type').alias('type')\n        )\n        .filter(f.col('type') == 'GP-DS')\n        .show(1000, truncate=False)\n    )\n    \n    \nexamine_sentence('35119481', 'Angiogenesis play')",
      "execution_count": 88,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+----------+----------------------------------+----+\n|pmid    |pmcid     |label                             |type|\n+--------+----------+----------------------------------+----+\n|35119481|PMC8940827|tumor                             |DS  |\n|35119481|PMC8940827|vascular endothelial growth factor|GP  |\n|35119481|PMC8940827|VEGF                              |GP  |\n|35119481|PMC8940827|fibroblast growth factor          |GP  |\n|35119481|PMC8940827|FGF                               |GP  |\n|35119481|PMC8940827|HCC                               |DS  |\n+--------+----------+----------------------------------+----+\n\n+--------+----------+----------------------------------+------+-----+\n|pmid    |pmcid     |label1                            |label2|type |\n+--------+----------+----------------------------------+------+-----+\n|35119481|PMC8940827|vascular endothelial growth factor|tumor |GP-DS|\n|35119481|PMC8940827|vascular endothelial growth factor|HCC   |GP-DS|\n|35119481|PMC8940827|VEGF                              |HCC   |GP-DS|\n|35119481|PMC8940827|fibroblast growth factor          |HCC   |GP-DS|\n|35119481|PMC8940827|FGF                               |HCC   |GP-DS|\n+--------+----------+----------------------------------+------+-----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:41:54.007820Z",
          "end_time": "2023-12-07T15:41:56.075049Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "# |35089541|Patient were grou...| METHODS|       1|       8|          6|\nexamine_sentence('35089541', 'Patient were grou')",
      "execution_count": 89,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+----------+------------------------+----+\n|pmid    |pmcid     |label                   |type|\n+--------+----------+------------------------+----+\n|35089541|PMC9098612|hypertension            |DS  |\n|35089541|PMC9098612|type 2 diabetes mellitus|DS  |\n|35089541|PMC9098612|T2DM                    |DS  |\n|35089541|PMC9098612|hemoglobin              |GP  |\n|35089541|PMC9098612|hypercholesterolaemia   |DS  |\n|35089541|PMC9098612|obesity                 |DS  |\n|35089541|PMC9098612|premature               |DS  |\n|35089541|PMC9098612|CAD                     |DS  |\n|35089541|PMC9098612|CAD                     |DS  |\n+--------+----------+------------------------+----+\n\n+--------+----------+----------+---------------------+-----+\n|pmid    |pmcid     |label1    |label2               |type |\n+--------+----------+----------+---------------------+-----+\n|35089541|PMC9098612|hemoglobin|hypertension         |GP-DS|\n|35089541|PMC9098612|hemoglobin|hypercholesterolaemia|GP-DS|\n|35089541|PMC9098612|hemoglobin|obesity              |GP-DS|\n|35089541|PMC9098612|hemoglobin|premature            |GP-DS|\n|35089541|PMC9098612|hemoglobin|CAD                  |GP-DS|\n|35089541|PMC9098612|hemoglobin|CAD                  |GP-DS|\n+--------+----------+----------+---------------------+-----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:43:09.679950Z",
          "end_time": "2023-12-07T15:43:11.123397Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "# |35090502|Therefore, to fur...| RESULTS|       3|       1|          1|\nexamine_sentence('35090502', 'Therefore, to fur')",
      "execution_count": 90,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+----------+-----------------+----+\n|pmid    |pmcid     |label            |type|\n+--------+----------+-----------------+----+\n|35090502|PMC8796493|NQO1             |GP  |\n|35090502|PMC8796493|diabetes mellitus|DS  |\n|35090502|PMC8796493|NQO1             |GP  |\n|35090502|PMC8796493|NQO1             |GP  |\n+--------+----------+-----------------+----+\n\n+--------+----------+------+-----------------+-----+\n|pmid    |pmcid     |label1|label2           |type |\n+--------+----------+------+-----------------+-----+\n|35090502|PMC8796493|NQO1  |diabetes mellitus|GP-DS|\n+--------+----------+------+-----------------+-----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "start_time": "2023-12-07T15:44:57.729230Z",
          "end_time": "2023-12-07T15:44:59.163844Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "# |35090502|All these data su...|ABSTRACT|       2|       1|          1|\nexamine_sentence('35090502', 'All these data su')",
      "execution_count": 93,
      "outputs": [
        {
          "output_type": "stream",
          "text": "+--------+----------+-----+----+\n|pmid    |pmcid     |label|type|\n+--------+----------+-----+----+\n|35090502|PMC8796493|NQO1 |GP  |\n|35090502|PMC8796493|DN   |DS  |\n|35090502|PMC8796493|Sirt1|GP  |\n+--------+----------+-----+----+\n\n+--------+----------+------+------+-----+\n|pmid    |pmcid     |label1|label2|type |\n+--------+----------+------+------+-----+\n|35090502|PMC8796493|NQO1  |DN    |GP-DS|\n+--------+----------+------+------+-----+\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.10.8",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "gist": {
      "id": "741a152efcff6c6a695737a366e52c33",
      "data": {
        "description": "missing publication in literature",
        "public": true
      }
    },
    "_draft": {
      "nbviewer_url": "https://gist.github.com/DSuveges/741a152efcff6c6a695737a366e52c33"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}