Skip to content

Instantly share code, notes, and snippets.

@DSuveges
Last active July 5, 2024 08:53
Show Gist options
  • Save DSuveges/beacf72b97feafad705b6a83d15167c5 to your computer and use it in GitHub Desktop.
Save DSuveges/beacf72b97feafad705b6a83d15167c5 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"cells":[{"cell_type":"code","execution_count":1,"id":"1445fafc","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:09:15.090852Z","start_time":"2024-07-05T08:09:05.752914Z"}},"outputs":[{"name":"stderr","output_type":"stream","text":["Setting default log level to \"WARN\".\n","To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n","24/07/05 08:09:08 INFO SparkEnv: Registering MapOutputTracker\n","24/07/05 08:09:08 INFO SparkEnv: Registering BlockManagerMaster\n","24/07/05 08:09:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat\n","24/07/05 08:09:08 INFO SparkEnv: Registering OutputCommitCoordinator\n"]}],"source":["from pyspark.sql import SparkSession, functions as f, types as t\n","\n","spark = SparkSession.builder.getOrCreate()\n","\n","matches_path = 'gs://open-targets-data-releases/24.06/output/etl/parquet/literature/matches/'\n","disease_path = 'gs://open-targets-data-releases/24.06/output/etl/parquet/diseases'"]},{"cell_type":"code","execution_count":12,"id":"470fe46c","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:21:12.616556Z","start_time":"2024-07-05T08:21:12.269975Z"}},"outputs":[{"name":"stdout","output_type":"stream","text":["+-----------+--------------------+\n","| keywordId| name|\n","+-----------+--------------------+\n","|EFO_0000255|angioimmunoblasti...|\n","|EFO_0000508| genetic disorder|\n","|EFO_0001054| leprosy|\n","|EFO_0004287|ventricular fibri...|\n","|EFO_0004302|anthropometric me...|\n","|EFO_0005039| hippocampal atrophy|\n","|EFO_0005551|dysembryoplastic ...|\n","|EFO_0005608|cortical opacity ...|\n","|EFO_0005622| Crohn's colitis|\n","|EFO_0006810|oleic acid measur...|\n","|EFO_0006932| putamen volume|\n","|EFO_0007185| brucellosis|\n","|EFO_0007225| cowpox|\n","|EFO_0007504| syphilis|\n","|EFO_0007771|pathologic comple...|\n","|EFO_0007932|multiple keratino...|\n","|EFO_0008280|serine protease 2...|\n","|EFO_0008379|P wave terminal f...|\n","|EFO_0008383|treatment outcome...|\n","|EFO_0008467|behavioural inhib...|\n","+-----------+--------------------+\n","only showing top 20 rows\n","\n"]},{"name":"stderr","output_type":"stream","text":["24/07/05 08:21:12 WARN CacheManager: Asked to cache already cached data.\n"]}],"source":["disease_df = (\n"," spark.read.parquet(disease_path)\n"," .select(\n"," f.col('id').alias('keywordId'),\n"," f.col('name')\n"," )\n"," .persist()\n",")\n","\n","disease_df.show()\n"]},{"cell_type":"code","execution_count":9,"id":"0e387b61","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:18:13.701174Z","start_time":"2024-07-05T08:17:14.678532Z"}},"outputs":[{"name":"stderr","output_type":"stream","text":["[Stage 22:=====================================================>(554 + 1) / 555]\r"]},{"name":"stdout","output_type":"stream","text":["+--------+--------------------+--------------------+--------------------+-------------+\n","| pmid| text| label| labelN| keywordId|\n","+--------+--------------------+--------------------+--------------------+-------------+\n","|28404884|Epigenetic silenc...| gastric tumors| gastrictumor| EFO_0003897|\n","|28404884|Statistically dif...| adenocarcinoma| adenocarcinoma| EFO_0000228|\n","|28404884|Taken together, w...| cancer| cancer|MONDO_0004992|\n","|28404884|Methylation of pr...| cancer| cancer|MONDO_0004992|\n","|28404884|Moreover, we cons...| cancer| cancer|MONDO_0004992|\n","|28404884|We collected 123 ...| cancer| cancer|MONDO_0004992|\n","|28404884|Further research ...| cancer| cancer|MONDO_0004992|\n","|28404884|A number of paper...| cancer| cancer|MONDO_0004992|\n","|28404884|Using this cutoff...| cancer| cancer|MONDO_0004992|\n","|28404884|One example is th...| cancer| cancer|MONDO_0004992|\n","|28404884|In 5 years after ...| metastatic disease| diseasmetastat| EFO_0009709|\n","|28404884|In only 8 patient...| metastatic disease| diseasmetastat| EFO_0009709|\n","|28404884|OS was defined as...| metastatic disease| diseasmetastat| EFO_0009709|\n","|28404884|The concept of fi...|oral squamous cel...|carcinomacelloral...| EFO_0000199|\n","|28404884|DFNA5 CpG4 methyl...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|Clinicopathologic...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|Due to the age di...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|The x-axis shows ...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|The plot shows DF...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|We analyzed DFNA5...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","+--------+--------------------+--------------------+--------------------+-------------+\n","only showing top 20 rows\n","\n"]},{"name":"stderr","output_type":"stream","text":["\r"," \r"]},{"data":{"text/plain":["88"]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["pmid = '28404884'\n","\n","pub_df = (\n"," spark.read.parquet(matches_path)\n"," .select(\n"," 'pmid',\n"," 'text',\n"," 'label',\n"," 'labelN',\n"," 'keywordId'\n"," )\n"," .filter(\n"," (f.col(\"pmid\") == pmid) &\n"," (f.col('type') == 'DS')\n"," )\n"," .distinct()\n"," .persist()\n",")\n","\n","pub_df.show()\n","pub_df.count()"]},{"cell_type":"code","execution_count":14,"id":"4ada7169","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:22:43.578427Z","start_time":"2024-07-05T08:22:43.292469Z"}},"outputs":[{"name":"stderr","output_type":"stream","text":["24/07/05 08:22:43 WARN AdvancedInferFilter: expression (none#859 = none#1436)\n","24/07/05 08:22:43 WARN AdvancedInferFilter: expression (none#859 = none#1436)\n"]},{"name":"stdout","output_type":"stream","text":["+--------+------------------------+-------------------------------+\n","|pmid |labelN |name |\n","+--------+------------------------+-------------------------------+\n","|28404884|neoplasm |neoplasm |\n","|28404884|tumor |neoplasm |\n","|28404884|cancer |cancer |\n","|28404884|diseasmetastat |metastatic neoplasm |\n","|28404884|carcinomacelloralsquamou|oral squamous cell carcinoma |\n","|28404884|lymphmetastasinode |lymph node metastatic carcinoma|\n","|28404884|dci |breast ductal carcinoma in situ|\n","|28404884|gastrictumor |stomach neoplasm |\n","|28404884|adenocarcinoma |adenocarcinoma |\n","|28404884|adenocarcinomabreast |breast adenocarcinoma |\n","+--------+------------------------+-------------------------------+\n","\n"]}],"source":["(\n"," pub_df\n"," .join(\n"," disease_df, on='keywordId', how='inner'\n"," )\n"," .select('pmid', 'labelN', 'name')\n"," .distinct()\n"," .show(210, truncate=False)\n",")"]},{"cell_type":"code","execution_count":null,"id":"de3beeb7","metadata":{},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.9"}},"nbformat":4,"nbformat_minor":5}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment