Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save DSuveges/97a6fb5e6467e8b30dc40bdf288a2a6d to your computer and use it in GitHub Desktop.
Save DSuveges/97a6fb5e6467e8b30dc40bdf288a2a6d to your computer and use it in GitHub Desktop.
Reproducing cooccurrences from matches
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"id": "292a5c20",
"cell_type": "markdown",
"source": "As we have grounded entities in the matches dataset, we can just use that dataset to generate all cooccurrences. All should be alright."
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-12T07:13:07.365019Z",
"end_time": "2023-12-12T07:13:18.684065Z"
},
"trusted": true
},
"id": "62acc5c5",
"cell_type": "code",
"source": "from pyspark.sql import SparkSession, functions as f, types as t\nfrom pyspark.sql.window import Window\n\nspark = SparkSession.builder.getOrCreate()\n\nmatches_path = 'gs://open-targets-data-releases/23.12/output/etl/parquet/literature/matches'\npmids = [\n '35101074', '34886853', '35119481'\n]",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "Setting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n23/12/12 07:13:11 INFO SparkEnv: Registering MapOutputTracker\n23/12/12 07:13:11 INFO SparkEnv: Registering BlockManagerMaster\n23/12/12 07:13:11 INFO SparkEnv: Registering BlockManagerMasterHeartbeat\n23/12/12 07:13:11 INFO SparkEnv: Registering OutputCommitCoordinator\n23/12/12 07:13:17 WARN GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=211; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history\n23/12/12 07:13:17 WARN GhfsStorageStatistics: Detected potential high latency for operation op_mkdirs. latencyMs=235; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-europe-west1-426265110888-ymkbpaze/64dcfdf8-46d3-4b5c-aad4-0a12ee0ba91a/spark-job-history\n",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-11T21:40:22.988501Z",
"start_time": "2023-12-11T21:38:24.896127Z"
},
"trusted": false
},
"id": "c42c2e8b",
"cell_type": "code",
"source": "matches = (\n spark.read.parquet(matches_path)\n .filter(f.col('pmid').isin(pmids))\n .persist()\n)\nmatches.count()",
"execution_count": 2,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": " \r"
},
{
"data": {
"text/plain": "410"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-11T22:41:04.761179Z",
"end_time": "2023-12-11T22:41:05.374964Z"
},
"trusted": true
},
"cell_type": "code",
"source": "matches.unpersist()\n",
"execution_count": 36,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 36,
"data": {
"text/plain": "DataFrame[pmid: string, pmcid: string, pubDate: string, date: date, year: int, month: int, day: int, organisms: array<string>, section: string, text: string, trace_source: string, endInSentence: bigint, label: string, labelN: string, sectionEnd: bigint, sectionStart: bigint, startInSentence: bigint, type: string, keywordId: string, isMapped: boolean]"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-11T21:43:31.270071Z",
"start_time": "2023-12-11T21:43:30.581728Z"
},
"trusted": false
},
"id": "2ba87e88",
"cell_type": "code",
"source": "matches.printSchema()\nmatches.show(1, False, True)\nmatches.select('text', 'label', 'sectionEnd', 'sectionStart').show()",
"execution_count": 4,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "root\n |-- pmid: string (nullable = true)\n |-- pmcid: string (nullable = true)\n |-- pubDate: string (nullable = true)\n |-- date: date (nullable = true)\n |-- year: integer (nullable = true)\n |-- month: integer (nullable = true)\n |-- day: integer (nullable = true)\n |-- organisms: array (nullable = true)\n | |-- element: string (containsNull = true)\n |-- section: string (nullable = true)\n |-- text: string (nullable = true)\n |-- trace_source: string (nullable = true)\n |-- endInSentence: long (nullable = true)\n |-- label: string (nullable = true)\n |-- labelN: string (nullable = true)\n |-- sectionEnd: long (nullable = true)\n |-- sectionStart: long (nullable = true)\n |-- startInSentence: long (nullable = true)\n |-- type: string (nullable = true)\n |-- keywordId: string (nullable = true)\n |-- isMapped: boolean (nullable = true)\n\n-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n pmid | 34886853 \n pmcid | PMC8656033 \n pubDate | 2021-12-01 \n date | 2021-12-01 \n year | 2021 \n month | 12 \n day | 1 \n organisms | [human] \n section | abstract \n text | Comparing the diagnosis results of preoperative MRI scan, serum tumor markers, and postoperative pathological examination using single factor comparison, we determined the MRI scan results, the comprehensive matching rate between serum tumor markers (squamous cell carcinoma antigen (SCCA), carbohydrate antigen 125 (CA125)) and postoperative pathological results, and the differences of sensitivity, specificity, and accuracy in the prediction of lymph node metastasis and para-uterine infiltration of cervical cancer. \n trace_source | \n endInSentence | 288 \n label | SCCA \n labelN | scca \n sectionEnd | 1088 \n sectionStart | 569 \n startInSentence | 284 \n type | GP \n keywordId | ENSG00000057149 \n isMapped | true \nonly showing top 1 row\n\n"
},
{
"name": "stdout",
"output_type": "stream",
"text": "+--------------------+--------------------+----------+------------+\n| text| label|sectionEnd|sectionStart|\n+--------------------+--------------------+----------+------------+\n|Comparing the dia...| SCCA| 1088| 569|\n|SCCA was first di...| SCCA| null| null|\n|Serum SCCA and CA...| SCCA| null| null|\n|Among them, SCCA ...| SCCA| null| null|\n|Among them, SCCA ...| SCCA| null| null|\n|Among them, SCCA ...| SCCA| null| null|\n|In summary, the s...| SCCA| null| null|\n|Adults with confi...|portal vein throm...| null| null|\n|Recommendations f...| albumin| null| null|\n|SCCA in periphera...|squamous cell car...| null| null|\n|Serum tumor marke...|squamous cell cancer| null| null|\n|SCCA is a highly ...|squamous cell car...| null| null|\n|Lenvatinib is a p...| KIT| null| null|\n|For proteins with...| TGF-beta| null| null|\n|Molecular testing...|Pitt-Hopkins synd...| null| null|\n|Lenvatinib, a mul...| PD-1| 336| 138|\n|Programmed cell d...| PD-1| null| null|\n|Additionally, PD-...| PD-1| null| null|\n|PD-L2, another li...| PD-1| null| null|\n|Pembrolizumab is ...| PD-1| null| null|\n+--------------------+--------------------+----------+------------+\nonly showing top 20 rows\n\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-11T22:43:57.702Z"
},
"trusted": true
},
"id": "5b84cb20",
"cell_type": "code",
"source": "matches_fields = [\n 'endInSentence', 'label', 'labelN', 'startInSentence',\n 'type', 'keywordId', 'pmid', 'text'\n]\n\n\n# Resulting pair: \"GT-DS\" -> gene/protein to disease/syndrome cooccurrence\n(\n spark.read.parquet(matches_path)\n .filter(f.col('type') == 'GP')\n .alias('left')\n .join(\n (\n spark.read.parquet(matches_path)\n .select('pmid', 'text', *matches_fields)\n .filter(f.col('type') == 'DS')\n .alias('right') \n ),\n on=[\n (f.col('left.pmid') == f.col('right.pmid')) &\n (f.col('left.text') == f.col('right.text'))\n ],\n how='inner'\n \n )\n .select(\n # Publication data:\n f.col('left.pmid').alias('pmid'),\n f.col('left.pmcid').alias('pmcid'), \n f.col('left.pubDate').alias('pubDate'),\n f.col('left.year').alias('year'), \n f.col('left.month').alias('month'), \n f.col('left.day').alias('day'), \n f.col('left.organisms').alias('organisms'), \n # Sentence data:\n f.col('left.section').alias('section'), \n f.col('left.text').alias('text'),\n # Disease data:\n f.col('left.startInSentence').alias('start1'), \n f.col('left.endInSentence').alias('end1'), \n f.col('left.label').alias('label1'), \n f.col('left.type').alias('type1'), \n f.col('left.keywordId').alias('keywordId1'),\n # Disease data:\n f.col('right.startInSentence').alias('start2'), \n f.col('right.endInSentence').alias('end2'), \n f.col('right.label').alias('label2'), \n f.col('right.type').alias('type2'), \n f.col('right.keywordId').alias('keywordId2'),\n # Cooccurrence data:\n f.concat_ws('-', f.col('left.type'), f.col('right.type')),\n )\n .write.mode('overwrite').parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n)\n\n",
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": "23/12/11 22:44:03 WARN GhfsStorageStatistics: Detected potential high latency for operation op_delete. latencyMs=104; previousMaxLatencyMs=0; operationCount=1; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11\n23/12/11 23:27:01 WARN GhfsStorageStatistics: Detected potential high latency for operation op_delete. latencyMs=581; previousMaxLatencyMs=104; operationCount=2; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11/_temporary\n23/12/11 23:27:01 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_close_operations. latencyMs=120; previousMaxLatencyMs=0; operationCount=1; context=gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11/_SUCCESS\n",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-11T22:00:28.460728Z",
"end_time": "2023-12-11T22:00:29.776916Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n cooccurrences\n .groupby('pmid1')\n .count()\n .show(truncate=False)\n)\n",
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": "[Stage 59:=====================================================>(513 + 9) / 522]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+--------+-----+\n|pmid1 |count|\n+--------+-----+\n|35119481|26 |\n|34886853|7 |\n|35101074|6 |\n+--------+-----+\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-11T21:56:16.440542Z",
"end_time": "2023-12-11T21:56:16.667568Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n matches\n .select(*[f.col(column).alias(f'{column}1') if column in matches_fields else column for column in matches.columns])\n .filter(\n f.col('type') == 'GP'\n )\n .alias('left')\n .show()\n)",
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": "+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\n| pmid| pmcid| pubDate| date|year|month|day| organisms| section| text|trace_source|endInSentence1| label1| labelN1|sectionEnd|sectionStart|startInSentence1|type1| keywordId1|isMapped|\n+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021| 12| 1| [human]|abstract|Comparing the dia...| | 288| SCCA| scca| 1088| 569| 284| GP|ENSG00000057149| true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021| 12| 1| [human]| intro|SCCA was first di...| | 4| SCCA| scca| null| null| 0| GP|ENSG00000057149| true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021| 12| 1| [human]| intro|Serum SCCA and CA...| | 10| SCCA| scca| null| null| 6| GP|ENSG00000057149| true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021| 12| 1| [human]| discuss|Among them, SCCA ...| | 16| SCCA| scca| null| null| 12| GP|ENSG00000057149| true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021| 12| 1| [human]| discuss|Among them, SCCA ...| | 16| SCCA| scca| null| null| 12| GP|ENSG00000057149| true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021| 12| 1| [human]| discuss|Among them, SCCA ...| | 107| SCCA| scca| null| null| 103| GP|ENSG00000057149| true|\n|34886853|PMC8656033|2021-12-01|2021-12-01|2021| 12| 1| [human]| concl|In summary, the s...| | 26| SCCA| scca| null| null| 22| GP|ENSG00000057149| true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022| 4| 1| null| intro|Recommendations f...| | 301| albumin| albumin| null| null| 294| GP|ENSG00000163631| true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022| 4| 1| null| intro|Lenvatinib is a p...| | 165| KIT| kit| null| null| 162| GP|ENSG00000157404| true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022| 1| 1|[mouse, human, ze...| results|For proteins with...| | 207| TGF-beta| tgfbeta| null| null| 199| GP|ENSG00000105329| true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022| 4| 1| null|abstract|Lenvatinib, a mul...| | 62| PD-1| 1pd| 336| 138| 58| GP|ENSG00000265681| true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022| 4| 1| null| intro|Programmed cell d...| | 37| PD-1| 1pd| null| null| 33| GP|ENSG00000265681| true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022| 4| 1| null| intro|Additionally, PD-...| | 18| PD-1| 1pd| null| null| 14| GP|ENSG00000265681| true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022| 4| 1| null| intro|PD-L2, another li...| | 29| PD-1| 1pd| null| null| 25| GP|ENSG00000265681| true|\n|35119481|PMC8940827|2022-04-01|2022-04-01|2022| 4| 1| null| intro|Pembrolizumab is ...| | 61| PD-1| 1pd| null| null| 57| GP|ENSG00000265681| true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022| 1| 1|[mouse, human, ze...| results|This is accompani...| | 240|Laminin subunit g...|1gammalamininsubunit| null| null| 217| GP|ENSG00000135862| true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022| 1| 1|[mouse, human, ze...| title|Novel insights in...| | 25| PORCN| porcn| null| null| 20| GP|ENSG00000102312| true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022| 1| 1|[mouse, human, ze...|abstract|Goltz syndrome (G...| | 130| PORCN| porcn| 151| 10| 125| GP|ENSG00000102312| true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022| 1| 1|[mouse, human, ze...|abstract|We report two cas...| | 212| PORCN| porcn| null| null| 207| GP|ENSG00000102312| true|\n|35101074|PMC8802438|2022-01-01|2022-01-01|2022| 1| 1|[mouse, human, ze...|abstract|Genotyping reveal...| | 33| PORCN| porcn| null| null| 28| GP|ENSG00000102312| true|\n+--------+----------+----------+----------+----+-----+---+--------------------+--------+--------------------+------------+--------------+--------------------+--------------------+----------+------------+----------------+-----+---------------+--------+\nonly showing top 20 rows\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-11T22:12:41.804318Z",
"end_time": "2023-12-11T22:12:42.924591Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n matches\n .groupBy('pmid', 'text')\n .agg(\n f.size(f.filter(f.collect_list('type'), lambda x: x == 'DS')).alias('disaseCount'),\n f.size(f.filter(f.collect_list('type'), lambda x: x == 'GP')).alias('targetCount')\n\n )\n .withColumn('expectedCooc', f.col('disaseCount') * f.col('targetCount'))\n .filter(f.col('expectedCooc') != 0)\n .groupBy('pmid')\n .agg(\n f.sum(f.col('expectedCooc')).alias('allCoocCount')\n )\n .show()\n)",
"execution_count": 27,
"outputs": [
{
"output_type": "stream",
"text": "+--------+------------+\n| pmid|allCoocCount|\n+--------+------------+\n|34886853| 7|\n|35101074| 6|\n|35119481| 26|\n+--------+------------+\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-12T07:13:30.107730Z",
"end_time": "2023-12-12T07:13:48.809086Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.parquet('gs://open-targets-data-releases/23.12/output/etl/parquet/literature/cooccurrences')\n .filter(f.col('type') == 'GP-DS')\n .count()\n)",
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "execute_result",
"execution_count": 2,
"data": {
"text/plain": "39556102"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-12T07:14:00.521926Z",
"end_time": "2023-12-12T07:14:07.322547Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n .count()\n)",
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "execute_result",
"execution_count": 3,
"data": {
"text/plain": "59223038"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2023-12-12T07:15:08.900489Z",
"end_time": "2023-12-12T07:15:38.341819Z"
},
"trusted": true
},
"cell_type": "code",
"source": "print(\n spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n .select('keywordId1', 'keywordId2', 'pmid')\n .distinct()\n .count()\n)\n\nprint(\n spark.read.parquet('gs://ot-team/dsuveges/cooccurrence_prototype_2023.23.11')\n .select('keywordId1', 'keywordId2')\n .distinct()\n .count()\n)",
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "21252765\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "[Stage 18:======================================================> (33 + 1) / 34]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "2033207\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "23/12/12 07:21:47 WARN YarnAllocator: Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n23/12/12 07:21:47 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 7 for reason Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n23/12/12 07:21:47 ERROR YarnScheduler: Lost executor 7 on ds-genetics-etl-test-m.c.open-targets-eu-dev.internal: Container from a bad node: container_1702299062935_0004_01_000007 on host: ds-genetics-etl-test-m.c.open-targets-eu-dev.internal. Exit status: 143. Diagnostics: [2023-12-12 07:21:47.161]Container killed on request. Exit code is 143\n[2023-12-12 07:21:47.161]Container exited with a non-zero exit code 143. \n[2023-12-12 07:21:47.162]Killed by external signal\n.\n",
"name": "stderr"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.10.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "Reproducing cooccurrences from matches",
"public": false
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment