Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save DSuveges/741a152efcff6c6a695737a366e52c33 to your computer and use it in GitHub Desktop.
Save DSuveges/741a152efcff6c6a695737a366e52c33 to your computer and use it in GitHub Desktop.
missing publication in literature
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Missing publication from literature\n\nUser [reported](https://community.opentargets.org/t/missing-paper-from-text-mining/1306) an obvious publication was missing from our literature dataset. We need to find out where did our piplines leaked.\n\n"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:34:54.249195Z",
"end_time": "2023-12-07T12:34:54.255707Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from pyspark.sql import SparkSession, functions as f, types as t\n\nspark = SparkSession.builder.getOrCreate()\n\npmid = '35101074'\npmcid = 'PMC8802438'\n\nmatches_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/matches/'\ncoocurrences_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/cooccurrences'\nfailed_matches_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/failedMatches/'\nraw_full_text_path = 'gs://otar025-epmc/ml02/fulltext'\n\n",
"execution_count": 11,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Is the paper in the matches dataset?"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:31:27.034659Z",
"end_time": "2023-12-07T12:32:08.662741Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.parquet(matches_path)\n .filter(\n (f.col('pmid') == pmid) | \n (f.col('pmcid') == pmcid)\n )\n .show(1, False, True)\n)",
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 31:======================================================>(99 + 1) / 100]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n pmid | 35101074 \n pmcid | PMC8802438 \n pubDate | 2022-01-01 \n date | 2022-01-01 \n year | 2022 \n month | 1 \n day | 1 \n organisms | [mouse, human, zebrafish, rabbit, Human] \n section | results \n text | For proteins with decreased abundance in the secretome, proteomaps-based in silico studies (based on the dysregulation of ECM proteins) hint toward perturbed signalling pathways including PI3K-AKT-, TGF-beta- and Hippo-signalling (Fig. 2). \n trace_source | \n endInSentence | 207 \n label | TGF-beta \n labelN | tgfbeta \n sectionEnd | null \n sectionStart | null \n startInSentence | 199 \n type | GP \n keywordId | ENSG00000105329 \n isMapped | true \nonly showing top 1 row\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:32:12.509411Z",
"end_time": "2023-12-07T12:32:27.272344Z"
},
"trusted": true
},
"cell_type": "code",
"source": "matches = (\n spark.read.parquet(matches_path)\n .filter(\n (f.col('pmid') == pmid) | \n (f.col('pmcid') == pmcid)\n )\n .select(\n 'pmid', 'pmcid', 'label', 'type', 'keywordId'\n )\n .distinct()\n .persist()\n)\n\nprint(matches.count())\nmatches.show(10, truncate=False)",
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 33:=====================================================>(521 + 1) / 522]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "93\n+--------+----------+----------------------------------------------------------------+----+---------------+\n|pmid |pmcid |label |type|keywordId |\n+--------+----------+----------------------------------------------------------------+----+---------------+\n|35101074|PMC8802438|penicillin |CD |CHEMBL1223 |\n|35101074|PMC8802438|connective tissue disorder |DS |EFO_1001986 |\n|35101074|PMC8802438|FDH |DS |MONDO_0010592 |\n|35101074|PMC8802438|Heat shock cognate 71 kDa protein |GP |ENSG00000109971|\n|35101074|PMC8802438|Ectonucleotide pyrophosphatase/phosphodiesterase family member 2|GP |ENSG00000136960|\n|35101074|PMC8802438|Alpha-actinin-4 |GP |ENSG00000130402|\n|35101074|PMC8802438|amino acid |CD |CHEMBL1201498 |\n|35101074|PMC8802438|astigmatism |DS |HP_0000483 |\n|35101074|PMC8802438|epilepsy |DS |EFO_0000474 |\n|35101074|PMC8802438|PERK |GP |ENSG00000172071|\n+--------+----------+----------------------------------------------------------------+----+---------------+\nonly showing top 10 rows\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:32:31.408990Z",
"end_time": "2023-12-07T12:32:31.764468Z"
},
"trusted": true
},
"cell_type": "code",
"source": "matches.groupby('type').count().show()",
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": "+----+-----+\n|type|count|\n+----+-----+\n| DS| 33|\n| GP| 47|\n| CD| 13|\n+----+-----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:38:53.321392Z",
"end_time": "2023-12-07T12:39:03.768942Z"
},
"trusted": true
},
"cell_type": "code",
"source": "cooc = (\n spark.read.parquet(coocurrences_path)\n .filter(\n ((f.col('pmid') == pmid) | \n (f.col('pmcid') == pmcid))\n )\n .persist()\n)\n\ncooc.count()",
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "execute_result",
"execution_count": 19,
"data": {
"text/plain": "7"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:40:04.804022Z",
"end_time": "2023-12-07T12:40:04.990647Z"
},
"trusted": true
},
"cell_type": "code",
"source": "cooc.select('label1', 'keywordId1', 'label2', 'keywordId2', 'section', 'type').show(truncate=False)",
"execution_count": 21,
"outputs": [
{
"output_type": "stream",
"text": "+---------+---------------+-------------------+-------------+--------+-----+\n|label1 |keywordId1 |label2 |keywordId2 |section |type |\n+---------+---------------+-------------------+-------------+--------+-----+\n|NF-kappaB|ENSG00000109320|glucose |CHEMBL1222250|results |GP-CD|\n|PORCN |ENSG00000102312|amino acid |CHEMBL1201498|intro |GP-CD|\n|PORCN |ENSG00000102312|amino acid |CHEMBL1201498|discuss |GP-CD|\n|GS |EFO_0007285 |amino acid |CHEMBL1201498|intro |DS-CD|\n|PORCN |ENSG00000102312|developmental delay|HP_0001263 |abstract|GP-DS|\n|PORCN |ENSG00000102312|Goltz syndrome |MONDO_0010592|abstract|GP-DS|\n|PORCN |ENSG00000102312|GS |EFO_0007285 |intro |GP-DS|\n+---------+---------------+-------------------+-------------+--------+-----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:42:57.925936Z",
"end_time": "2023-12-07T12:43:13.913757Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.parquet(matches_path)\n .filter(\n (f.col('pmid') == pmid) | \n (f.col('pmcid') == pmcid)\n )\n .filter(\n (f.col('type') == 'DS') &\n (f.col('section') == 'abstract')\n )\n .select('label', 'keywordId', 'section')\n .show(50, truncate=False)\n)",
"execution_count": 25,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 109:====================================================>(396 + 1) / 397]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+-------------------+-------------+--------+\n|label |keywordId |section |\n+-------------------+-------------+--------+\n|microcephaly |HP_0000252 |abstract|\n|microcephaly |HP_0000252 |abstract|\n|Goltz syndrome |MONDO_0010592|abstract|\n|cerebral atrophy |HP_0002059 |abstract|\n|GS |EFO_0007285 |abstract|\n|GS |EFO_0007285 |abstract|\n|GS |EFO_0007285 |abstract|\n|GS |EFO_0007285 |abstract|\n|GS |EFO_0007285 |abstract|\n|developmental delay|HP_0001263 |abstract|\n|developmental delay|HP_0001263 |abstract|\n|epilepsy |EFO_0000474 |abstract|\n|epilepsy |EFO_0000474 |abstract|\n+-------------------+-------------+--------+\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:44:31.073787Z",
"end_time": "2023-12-07T12:44:38.847942Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.parquet(matches_path)\n .filter(\n (f.col('pmid') == pmid) | \n (f.col('pmcid') == pmcid)\n )\n .filter(\n (f.col('label') == 'epilepsy') &\n (f.col('section') == 'abstract')\n )\n .select('label', 'keywordId', 'text')\n .show(50, truncate=False)\n)",
"execution_count": 26,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 115:====================================================>(395 + 2) / 397]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|label |keywordId |text |\n+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|epilepsy|EFO_0000474|We report two cases: one girl suffering from typical skin and skeletal abnormalities, developmental delay, microcephaly, thin corpus callosum, periventricular gliosis and drug-resistant epilepsy caused by a PORCN nonsense-mutation (c.283C > T, p.Arg95Ter).|\n|epilepsy|EFO_0000474|The other patient is a boy with a supernumerary nipple and skeletal anomalies but also, developmental delay, microcephaly, cerebral atrophy with delayed myelination and drug-resistant epilepsy as predominant features. |\n+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r[Stage 115:====================================================>(396 + 1) / 397]\r\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T12:45:13.873730Z",
"end_time": "2023-12-07T12:45:23.703238Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.parquet(matches_path)\n .filter(\n (f.col('pmid') == pmid) | \n (f.col('pmcid') == pmcid)\n )\n .filter(\n (f.col('label') == 'PORCN') &\n (f.col('section') == 'abstract')\n )\n .select('label', 'keywordId', 'text')\n .show(50, truncate=False)\n)",
"execution_count": 27,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 121:====================================================>(396 + 1) / 397]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+-----+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|label|keywordId |text |\n+-----+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|PORCN|ENSG00000102312|Goltz syndrome (GS) is a X-linked disorder defined by defects of mesodermal- and ectodermal-derived structures and caused by PORCN mutations. |\n|PORCN|ENSG00000102312|We report two cases: one girl suffering from typical skin and skeletal abnormalities, developmental delay, microcephaly, thin corpus callosum, periventricular gliosis and drug-resistant epilepsy caused by a PORCN nonsense-mutation (c.283C > T, p.Arg95Ter). |\n|PORCN|ENSG00000102312|Genotyping revealed a novel PORCN missense-mutation (c.847G > C, p.Asp283His) absent in the Genome Aggregation Database (gnomAD) but also identified in his asymptomatic mother. |\n|PORCN|ENSG00000102312|Given that non-random X-chromosome inactivation was excluded in the mother, fibroblasts of the index had been analyzed for PORCN protein-abundance and -distribution, vulnerability against additional ER-stress burden as well as for protein secretion revealing changes.|\n+-----+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T13:03:22.487874Z",
"end_time": "2023-12-07T13:07:26.952044Z"
},
"trusted": true
},
"cell_type": "code",
"source": "raw_input = (\n spark.read.json('gs://otar025-epmc/ml02/fulltext/', recursiveFileLookup=True)\n .filter(f.col('pmcid') == pmcid)\n .select(\n 'pmid',\n 'pmcid',\n 'timestamp',\n f.input_file_name().alias('filename'),\n f.explode(f.col('sentences')).alias('sentence')\n )\n .persist()\n)",
"execution_count": 35,
"outputs": [
{
"output_type": "stream",
"text": "23/12/07 13:03:38 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_operations. latencyMs=337; previousMaxLatencyMs=5; operationCount=9502; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history/application_1701951553986_0001.inprogress\n \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T13:07:26.955089Z",
"end_time": "2023-12-07T13:07:26.960390Z"
},
"trusted": true
},
"cell_type": "code",
"source": "raw_input.printSchema()",
"execution_count": 36,
"outputs": [
{
"output_type": "stream",
"text": "root\n |-- pmid: string (nullable = true)\n |-- pmcid: string (nullable = true)\n |-- timestamp: string (nullable = true)\n |-- filename: string (nullable = false)\n |-- sentence: struct (nullable = true)\n | |-- co-occurrence: array (nullable = true)\n | | |-- element: struct (containsNull = true)\n | | | |-- association: long (nullable = true)\n | | | |-- end1: long (nullable = true)\n | | | |-- end2: long (nullable = true)\n | | | |-- label1: string (nullable = true)\n | | | |-- label2: string (nullable = true)\n | | | |-- sentEvidenceScore: double (nullable = true)\n | | | |-- start1: long (nullable = true)\n | | | |-- start2: long (nullable = true)\n | | | |-- type: string (nullable = true)\n | |-- matches: array (nullable = true)\n | | |-- element: struct (containsNull = true)\n | | | |-- endInSentence: long (nullable = true)\n | | | |-- label: string (nullable = true)\n | | | |-- startInSentence: long (nullable = true)\n | | | |-- type: string (nullable = true)\n | |-- section: string (nullable = true)\n | |-- sectionEnd: long (nullable = true)\n | |-- sectionStart: long (nullable = true)\n | |-- text: string (nullable = true)\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T13:07:48.678518Z",
"end_time": "2023-12-07T13:09:44.489341Z"
},
"trusted": true
},
"cell_type": "code",
"source": "raw_input.count()",
"execution_count": 37,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "execute_result",
"execution_count": 37,
"data": {
"text/plain": "278"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T13:12:04.868310Z",
"end_time": "2023-12-07T13:12:05.981381Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n raw_input\n .select(\n 'pmid',\n f.explode(f.col('sentence.co-occurrence')).alias('coocurrence'),\n f.col('sentence.text').alias('text'),\n f.col('sentence.section').alias('section'),\n )\n .filter(f.col('section') == 'ABSTRACT')\n .count()\n# .show()\n)",
"execution_count": 40,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "execute_result",
"execution_count": 40,
"data": {
"text/plain": "4"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T14:13:56.802160Z",
"end_time": "2023-12-07T14:13:58.144476Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n raw_input\n .select(\n 'pmid','filename',\n f.explode(f.col('sentence.co-occurrence')).alias('coocurrence'),\n f.col('sentence.text').alias('text'),\n f.col('sentence.section').alias('section'),\n )\n .filter(f.col('section') == 'ABSTRACT')\n .select(\n '*',\n f.col('coocurrence.label1').alias('label1'),\n f.col('coocurrence.label2').alias('label2')\n )\n .filter(f.col('text').startswith(sentence))\n .drop('coocurrence', 'filename')\n .distinct()\n# .count()\n .show(1000)\n)",
"execution_count": 52,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 204:=========================================> (2274 + 50) / 2727]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+--------+--------------------+--------+------+-------------------+\n| pmid| text| section|label1| label2|\n+--------+--------------------+--------+------+-------------------+\n|35101074|We report two cas...|ABSTRACT| PORCN|developmental delay|\n+--------+--------------------+--------+------+-------------------+\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T14:40:13.084209Z",
"end_time": "2023-12-07T14:40:13.087609Z"
},
"trusted": true
},
"cell_type": "code",
"source": "path_to_file = 'gs://otar025-epmc/ml02/fulltext/2023_05_30/NMP_patch-29-05-2023-201.jsonl'",
"execution_count": 53,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T14:12:46.722009Z",
"end_time": "2023-12-07T14:12:47.952739Z"
},
"trusted": true
},
"cell_type": "code",
"source": "sentence = 'We report two cases'\n\n(\n raw_input\n .select(\n 'pmid','filename',\n f.explode(f.col('sentence.matches')).alias('matches'),\n f.col('sentence.text').alias('text'),\n f.col('sentence.section').alias('section'),\n )\n .filter(f.col('text').startswith(sentence))\n .select(\n '*',\n f.col('matches.label').alias('label'),\n f.col('matches.type').alias('type')\n )\n .drop('matches', 'filename')\n .distinct()\n# .count()\n .show(1000)\n)",
"execution_count": 50,
"outputs": [
{
"output_type": "stream",
"text": "\r[Stage 195:==============================> (1690 + 34) / 2727]\r\r[Stage 195:=========================================> (2245 + 33) / 2727]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+--------+--------------------+--------+-------------------+----+\n| pmid| text| section| label|type|\n+--------+--------------------+--------+-------------------+----+\n|35101074|We report two cas...|ABSTRACT| periventricular| DS|\n|35101074|We report two cas...|ABSTRACT|developmental delay| DS|\n|35101074|We report two cas...|ABSTRACT| microcephaly| DS|\n|35101074|We report two cas...|ABSTRACT| PORCN| GP|\n|35101074|We report two cas...|ABSTRACT| epilepsy| DS|\n+--------+--------------------+--------+-------------------+----+\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r[Stage 195:==================================================>(2705 + 8) / 2727]\r\r \r",
"name": "stderr"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Are there always one cooccurrence from one sentence?"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:16:57.965064Z",
"end_time": "2023-12-07T15:16:58.791155Z"
},
"trusted": true
},
"cell_type": "code",
"source": "cooc_count = (\n spark.read.json(path_to_file)\n .withColumn('sentence', f.explode(f.col('sentences')))\n .select(\n 'pmid',\n f.col('sentence.text').alias('text'),\n f.col('sentence.section').alias('section'),\n f.size(\n f.filter(\n f.col('sentence.matches'),\n lambda col: col.type == 'GP'\n )\n ).alias('match_gp'),\n f.size(\n f.filter(\n f.col('sentence.matches'),\n lambda col: col.type == 'DS'\n )\n ).alias('match_ds'),\n f.size(\n f.filter(\n f.col('sentence.co-occurrence'),\n lambda col: col.type == 'GP-DS'\n )\n ).alias('coocc_ds_gp')\n )\n .filter(f.col('coocc_ds_gp')>0)\n .persist()\n)\n\ncooc_count.show()",
"execution_count": 68,
"outputs": [
{
"output_type": "stream",
"text": "23/12/07 15:16:58 WARN CacheManager: Asked to cache already cached data.\n",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+--------+--------------------+--------+--------+--------+-----------+\n| pmid| text| section|match_gp|match_ds|coocc_ds_gp|\n+--------+--------------------+--------+--------+--------+-----------+\n|34886853|Clinical value of...| title| 1| 2| 2|\n|34886853|Comparing the dia...|ABSTRACT| 4| 4| 9|\n|34886853|The levels of SCC...|ABSTRACT| 1| 1| 1|\n|34886853|SCCA was first di...| INTRO| 1| 1| 1|\n|34886853|It has been confi...| INTRO| 1| 2| 2|\n|34886853|Serum SCCA and CA...| INTRO| 2| 2| 4|\n|34886853|This study intend...| INTRO| 1| 2| 2|\n|34886853|Inclusion criteri...| METHODS| 1| 2| 1|\n|34886853|Relationship betw...| RESULTS| 1| 1| 1|\n|34886853|The positive rate...| RESULTS| 1| 4| 4|\n|34886853|Among them, SCCA ...| DISCUSS| 2| 2| 4|\n|34886853|Among them, SCCA ...| DISCUSS| 3| 1| 2|\n|34886853|CA125 protein is ...| DISCUSS| 1| 3| 3|\n|34886853|This study found ...| DISCUSS| 1| 3| 1|\n|34886853|In summary, the s...| CONCL| 2| 5| 10|\n|34889997|Table 2 shows tha...| RESULTS| 1| 1| 1|\n|34889997|Likewise, LH leve...| RESULTS| 1| 1| 1|\n|34889997|Besides, AMH was ...| DISCUSS| 2| 1| 2|\n|34889997|Serum AMH has bee...| DISCUSS| 1| 1| 1|\n|35119481|Lenvatinib, a mul...|ABSTRACT| 1| 1| 1|\n+--------+--------------------+--------+--------+--------+-----------+\nonly showing top 20 rows\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:35:58.602676Z",
"end_time": "2023-12-07T15:35:58.811195Z"
},
"trusted": true
},
"cell_type": "code",
"source": "print(cooc_count.count())\nprint(\n cooc_count\n .filter(\n f.col('coocc_ds_gp') != (f.col('match_gp') * f.col('match_ds'))\n )\n .count()\n)\n(\n cooc_count\n .filter(\n f.col('coocc_ds_gp') != (f.col('match_gp') * f.col('match_ds'))\n )\n .show()\n)",
"execution_count": 84,
"outputs": [
{
"output_type": "stream",
"text": "2612\n816\n+--------+--------------------+--------+--------+--------+-----------+\n| pmid| text| section|match_gp|match_ds|coocc_ds_gp|\n+--------+--------------------+--------+--------+--------+-----------+\n|34886853|Comparing the dia...|ABSTRACT| 4| 4| 9|\n|34886853|Inclusion criteri...| METHODS| 1| 2| 1|\n|34886853|Among them, SCCA ...| DISCUSS| 3| 1| 2|\n|34886853|This study found ...| DISCUSS| 1| 3| 1|\n|35119481|Angiogenesis play...| INTRO| 4| 2| 5|\n|35119481|In patients with ...| INTRO| 3| 1| 1|\n|35119481|Recommendations f...| INTRO| 2| 2| 3|\n|35119507|Our study suggest...|ABSTRACT| 3| 1| 1|\n|35119507|The secretion of ...| RESULTS| 4| 1| 1|\n|35119507|Mouse brain tissu...| RESULTS| 3| 1| 1|\n|35089541|This study showed...|ABSTRACT| 5| 2| 9|\n|35089541|Several studies, ...| INTRO| 4| 2| 7|\n|35089541|Patient were grou...| METHODS| 1| 8| 6|\n|35089541|Regarding EPC OC ...| RESULTS| 4| 1| 3|\n|35089541|Our study showed ...| DISCUSS| 3| 2| 5|\n|35090502|All these data su...|ABSTRACT| 2| 1| 1|\n|35090502|In this study, we...| INTRO| 2| 2| 2|\n|35090502|NQO1 overexpressi...| RESULTS| 4| 1| 1|\n|35090502|Therefore, to fur...| RESULTS| 3| 1| 1|\n|35090502|Previous studies ...| DISCUSS| 1| 2| 1|\n+--------+--------------------+--------+--------+--------+-----------+\nonly showing top 20 rows\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:24:05.219318Z",
"end_time": "2023-12-07T15:24:05.534337Z"
},
"trusted": true
},
"cell_type": "code",
"source": "print(\n cooc_count\n .filter(\n ~((f.col('match_gp') == 1) & (f.col('match_ds') ==1))\n )\n .count()\n)\nprint(\n cooc_count\n .filter(\n ~((f.col('match_gp') == 1) & (f.col('match_ds') ==1))\n )\n .filter(\n f.col('coocc_ds_gp') != (f.col('match_gp') * f.col('match_ds'))\n )\n .count()\n)\nprint(\n cooc_count\n .filter(\n ~((f.col('match_gp') == 1) & (f.col('match_ds') ==1))\n )\n .filter(\n f.col('coocc_ds_gp') > (f.col('match_gp') * f.col('match_ds'))\n )\n .count()\n)",
"execution_count": 75,
"outputs": [
{
"output_type": "stream",
"text": "1634\n816\n0\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Investigating this strange case:\n\n```\n+--------+--------------------+--------+--------+--------+-----------+\n| pmid| text| section|match_gp|match_ds|coocc_ds_gp|\n+--------+--------------------+--------+--------+--------+-----------+\n|34886853|Comparing the dia...|ABSTRACT| 4| 4| 9|\n+--------+--------------------+--------+--------+--------+-----------+\n```"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:33:02.882964Z",
"end_time": "2023-12-07T15:33:03.680864Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.json(path_to_file)\n .filter(f.col('pmid') == '34886853')\n .withColumn('sentence', f.explode(f.col('sentences')))\n .filter(f.col('sentence.text').startswith('Comparing the dia'))\n .select(\n 'sentence.text'\n )\n .show(1000, truncate=False)\n)",
"execution_count": 83,
"outputs": [
{
"output_type": "stream",
"text": "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|text |\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|Comparing the diagnosis results of preoperative MRI scan, serum tumor markers, and postoperative pathological examination using single factor comparison, we determined the MRI scan results, the comprehensive matching rate between serum tumor markers (squamous cell carcinoma antigen (SCCA), carbohydrate antigen 125 (CA125)) and postoperative pathological results, and the differences of sensitivity, specificity, and accuracy in the prediction of lymph node metastasis and para-uterine infiltration of cervical cancer.|\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:29:43.857506Z",
"end_time": "2023-12-07T15:29:44.646582Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.json(path_to_file)\n .filter(f.col('pmid') == '34886853')\n .withColumn('sentence', f.explode(f.col('sentences')))\n .filter(f.col('sentence.text').startswith('Comparing the dia'))\n .select(\n 'pmid', 'pmcid', f.explode('sentence.matches').alias('match')\n )\n .select(\n 'pmid', 'pmcid', \n f.col('match.label').alias('label'),\n f.col('match.type').alias('type')\n )\n .show(1000, truncate=False)\n)\n",
"execution_count": 80,
"outputs": [
{
"output_type": "stream",
"text": "+--------+----------+-------------------------------+----+\n|pmid |pmcid |label |type|\n+--------+----------+-------------------------------+----+\n|34886853|PMC8656033|tumor |DS |\n|34886853|PMC8656033|tumor |DS |\n|34886853|PMC8656033|squamous cell carcinoma antigen|GP |\n|34886853|PMC8656033|SCCA |GP |\n|34886853|PMC8656033|carbohydrate antigen 125 |GP |\n|34886853|PMC8656033|CA125 |GP |\n|34886853|PMC8656033|lymph node metastasis |DS |\n|34886853|PMC8656033|cervical cancer |DS |\n+--------+----------+-------------------------------+----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:31:29.387935Z",
"end_time": "2023-12-07T15:31:30.127261Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.json(path_to_file)\n .filter(f.col('pmid') == '34886853')\n .withColumn('sentence', f.explode(f.col('sentences')))\n .filter(f.col('sentence.text').startswith('Comparing the dia'))\n .select(\n 'pmid', 'pmcid', f.explode('sentence.co-occurrence').alias('co')\n )\n .select(\n 'pmid', 'pmcid', \n f.col('co.label1').alias('label1'),\n f.col('co.label2').alias('label2'),\n f.col('co.type').alias('type')\n )\n .show(1000, truncate=False)\n)",
"execution_count": 82,
"outputs": [
{
"output_type": "stream",
"text": "+--------+----------+-------------------------------+---------------------+-----+\n|pmid |pmcid |label1 |label2 |type |\n+--------+----------+-------------------------------+---------------------+-----+\n|34886853|PMC8656033|squamous cell carcinoma antigen|tumor |GP-DS|\n|34886853|PMC8656033|squamous cell carcinoma antigen|lymph node metastasis|GP-DS|\n|34886853|PMC8656033|squamous cell carcinoma antigen|cervical cancer |GP-DS|\n|34886853|PMC8656033|SCCA |lymph node metastasis|GP-DS|\n|34886853|PMC8656033|SCCA |cervical cancer |GP-DS|\n|34886853|PMC8656033|carbohydrate antigen 125 |lymph node metastasis|GP-DS|\n|34886853|PMC8656033|carbohydrate antigen 125 |cervical cancer |GP-DS|\n|34886853|PMC8656033|CA125 |lymph node metastasis|GP-DS|\n|34886853|PMC8656033|CA125 |cervical cancer |GP-DS|\n+--------+----------+-------------------------------+---------------------+-----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Doing the same for \n```\n35119481|Angiogenesis play\n```"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:41:25.075331Z",
"end_time": "2023-12-07T15:41:26.688174Z"
},
"trusted": true
},
"cell_type": "code",
"source": "def examine_sentence(pmid:str, sentence_start:str) -> None:\n\n (\n spark.read.json(path_to_file)\n .filter(f.col('pmid') == pmid)\n .withColumn('sentence', f.explode(f.col('sentences')))\n .filter(f.col('sentence.text').startswith(sentence_start))\n .select(\n 'pmid', 'pmcid', f.explode('sentence.matches').alias('match')\n )\n .select(\n 'pmid', 'pmcid', \n f.col('match.label').alias('label'),\n f.col('match.type').alias('type')\n )\n .filter(f.col('type').isin(['DS', 'GP'])) \n .show(1000, truncate=False)\n )\n (\n spark.read.json(path_to_file)\n .filter(f.col('pmid') == pmid)\n .withColumn('sentence', f.explode(f.col('sentences')))\n .filter(f.col('sentence.text').startswith(sentence_start))\n .select(\n 'pmid', 'pmcid', f.explode('sentence.co-occurrence').alias('co')\n )\n .select(\n 'pmid', 'pmcid', \n f.col('co.label1').alias('label1'),\n f.col('co.label2').alias('label2'),\n f.col('co.type').alias('type')\n )\n .filter(f.col('type') == 'GP-DS')\n .show(1000, truncate=False)\n )\n \n \nexamine_sentence('35119481', 'Angiogenesis play')",
"execution_count": 88,
"outputs": [
{
"output_type": "stream",
"text": "+--------+----------+----------------------------------+----+\n|pmid |pmcid |label |type|\n+--------+----------+----------------------------------+----+\n|35119481|PMC8940827|tumor |DS |\n|35119481|PMC8940827|vascular endothelial growth factor|GP |\n|35119481|PMC8940827|VEGF |GP |\n|35119481|PMC8940827|fibroblast growth factor |GP |\n|35119481|PMC8940827|FGF |GP |\n|35119481|PMC8940827|HCC |DS |\n+--------+----------+----------------------------------+----+\n\n+--------+----------+----------------------------------+------+-----+\n|pmid |pmcid |label1 |label2|type |\n+--------+----------+----------------------------------+------+-----+\n|35119481|PMC8940827|vascular endothelial growth factor|tumor |GP-DS|\n|35119481|PMC8940827|vascular endothelial growth factor|HCC |GP-DS|\n|35119481|PMC8940827|VEGF |HCC |GP-DS|\n|35119481|PMC8940827|fibroblast growth factor |HCC |GP-DS|\n|35119481|PMC8940827|FGF |HCC |GP-DS|\n+--------+----------+----------------------------------+------+-----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:41:54.007820Z",
"end_time": "2023-12-07T15:41:56.075049Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# |35089541|Patient were grou...| METHODS| 1| 8| 6|\nexamine_sentence('35089541', 'Patient were grou')",
"execution_count": 89,
"outputs": [
{
"output_type": "stream",
"text": "+--------+----------+------------------------+----+\n|pmid |pmcid |label |type|\n+--------+----------+------------------------+----+\n|35089541|PMC9098612|hypertension |DS |\n|35089541|PMC9098612|type 2 diabetes mellitus|DS |\n|35089541|PMC9098612|T2DM |DS |\n|35089541|PMC9098612|hemoglobin |GP |\n|35089541|PMC9098612|hypercholesterolaemia |DS |\n|35089541|PMC9098612|obesity |DS |\n|35089541|PMC9098612|premature |DS |\n|35089541|PMC9098612|CAD |DS |\n|35089541|PMC9098612|CAD |DS |\n+--------+----------+------------------------+----+\n\n+--------+----------+----------+---------------------+-----+\n|pmid |pmcid |label1 |label2 |type |\n+--------+----------+----------+---------------------+-----+\n|35089541|PMC9098612|hemoglobin|hypertension |GP-DS|\n|35089541|PMC9098612|hemoglobin|hypercholesterolaemia|GP-DS|\n|35089541|PMC9098612|hemoglobin|obesity |GP-DS|\n|35089541|PMC9098612|hemoglobin|premature |GP-DS|\n|35089541|PMC9098612|hemoglobin|CAD |GP-DS|\n|35089541|PMC9098612|hemoglobin|CAD |GP-DS|\n+--------+----------+----------+---------------------+-----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:43:09.679950Z",
"end_time": "2023-12-07T15:43:11.123397Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# |35090502|Therefore, to fur...| RESULTS| 3| 1| 1|\nexamine_sentence('35090502', 'Therefore, to fur')",
"execution_count": 90,
"outputs": [
{
"output_type": "stream",
"text": "+--------+----------+-----------------+----+\n|pmid |pmcid |label |type|\n+--------+----------+-----------------+----+\n|35090502|PMC8796493|NQO1 |GP |\n|35090502|PMC8796493|diabetes mellitus|DS |\n|35090502|PMC8796493|NQO1 |GP |\n|35090502|PMC8796493|NQO1 |GP |\n+--------+----------+-----------------+----+\n\n+--------+----------+------+-----------------+-----+\n|pmid |pmcid |label1|label2 |type |\n+--------+----------+------+-----------------+-----+\n|35090502|PMC8796493|NQO1 |diabetes mellitus|GP-DS|\n+--------+----------+------+-----------------+-----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-07T15:44:57.729230Z",
"end_time": "2023-12-07T15:44:59.163844Z"
},
"trusted": true
},
"cell_type": "code",
"source": "# |35090502|All these data su...|ABSTRACT| 2| 1| 1|\nexamine_sentence('35090502', 'All these data su')",
"execution_count": 93,
"outputs": [
{
"output_type": "stream",
"text": "+--------+----------+-----+----+\n|pmid |pmcid |label|type|\n+--------+----------+-----+----+\n|35090502|PMC8796493|NQO1 |GP |\n|35090502|PMC8796493|DN |DS |\n|35090502|PMC8796493|Sirt1|GP |\n+--------+----------+-----+----+\n\n+--------+----------+------+------+-----+\n|pmid |pmcid |label1|label2|type |\n+--------+----------+------+------+-----+\n|35090502|PMC8796493|NQO1 |DN |GP-DS|\n+--------+----------+------+------+-----+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.10.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "741a152efcff6c6a695737a366e52c33",
"data": {
"description": "missing publication in literature",
"public": true
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/DSuveges/741a152efcff6c6a695737a366e52c33"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment