Last active
July 5, 2024 08:53
-
-
Save DSuveges/beacf72b97feafad705b6a83d15167c5 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells":[{"cell_type":"code","execution_count":1,"id":"1445fafc","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:09:15.090852Z","start_time":"2024-07-05T08:09:05.752914Z"}},"outputs":[{"name":"stderr","output_type":"stream","text":["Setting default log level to \"WARN\".\n","To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n","24/07/05 08:09:08 INFO SparkEnv: Registering MapOutputTracker\n","24/07/05 08:09:08 INFO SparkEnv: Registering BlockManagerMaster\n","24/07/05 08:09:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat\n","24/07/05 08:09:08 INFO SparkEnv: Registering OutputCommitCoordinator\n"]}],"source":["from pyspark.sql import SparkSession, functions as f, types as t\n","\n","spark = SparkSession.builder.getOrCreate()\n","\n","matches_path = 'gs://open-targets-data-releases/24.06/output/etl/parquet/literature/matches/'\n","disease_path = 'gs://open-targets-data-releases/24.06/output/etl/parquet/diseases'"]},{"cell_type":"code","execution_count":12,"id":"470fe46c","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:21:12.616556Z","start_time":"2024-07-05T08:21:12.269975Z"}},"outputs":[{"name":"stdout","output_type":"stream","text":["+-----------+--------------------+\n","| keywordId| name|\n","+-----------+--------------------+\n","|EFO_0000255|angioimmunoblasti...|\n","|EFO_0000508| genetic disorder|\n","|EFO_0001054| leprosy|\n","|EFO_0004287|ventricular fibri...|\n","|EFO_0004302|anthropometric me...|\n","|EFO_0005039| hippocampal atrophy|\n","|EFO_0005551|dysembryoplastic ...|\n","|EFO_0005608|cortical opacity ...|\n","|EFO_0005622| Crohn's colitis|\n","|EFO_0006810|oleic acid measur...|\n","|EFO_0006932| putamen volume|\n","|EFO_0007185| brucellosis|\n","|EFO_0007225| cowpox|\n","|EFO_0007504| syphilis|\n","|EFO_0007771|pathologic comple...|\n","|EFO_0007932|multiple keratino...|\n","|EFO_0008280|serine protease 2...|\n","|EFO_0008379|P wave terminal f...|\n","|EFO_0008383|treatment outcome...|\n","|EFO_0008467|behavioural inhib...|\n","+-----------+--------------------+\n","only showing top 20 rows\n","\n"]},{"name":"stderr","output_type":"stream","text":["24/07/05 08:21:12 WARN CacheManager: Asked to cache already cached data.\n"]}],"source":["disease_df = (\n"," spark.read.parquet(disease_path)\n"," .select(\n"," f.col('id').alias('keywordId'),\n"," f.col('name')\n"," )\n"," .persist()\n",")\n","\n","disease_df.show()\n"]},{"cell_type":"code","execution_count":9,"id":"0e387b61","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:18:13.701174Z","start_time":"2024-07-05T08:17:14.678532Z"}},"outputs":[{"name":"stderr","output_type":"stream","text":["[Stage 22:=====================================================>(554 + 1) / 555]\r"]},{"name":"stdout","output_type":"stream","text":["+--------+--------------------+--------------------+--------------------+-------------+\n","| pmid| text| label| labelN| keywordId|\n","+--------+--------------------+--------------------+--------------------+-------------+\n","|28404884|Epigenetic silenc...| gastric tumors| gastrictumor| EFO_0003897|\n","|28404884|Statistically dif...| adenocarcinoma| adenocarcinoma| EFO_0000228|\n","|28404884|Taken together, w...| cancer| cancer|MONDO_0004992|\n","|28404884|Methylation of pr...| cancer| cancer|MONDO_0004992|\n","|28404884|Moreover, we cons...| cancer| cancer|MONDO_0004992|\n","|28404884|We collected 123 ...| cancer| cancer|MONDO_0004992|\n","|28404884|Further research ...| cancer| cancer|MONDO_0004992|\n","|28404884|A number of paper...| cancer| cancer|MONDO_0004992|\n","|28404884|Using this cutoff...| cancer| cancer|MONDO_0004992|\n","|28404884|One example is th...| cancer| cancer|MONDO_0004992|\n","|28404884|In 5 years after ...| metastatic disease| diseasmetastat| EFO_0009709|\n","|28404884|In only 8 patient...| metastatic disease| diseasmetastat| EFO_0009709|\n","|28404884|OS was defined as...| metastatic disease| diseasmetastat| EFO_0009709|\n","|28404884|The concept of fi...|oral squamous cel...|carcinomacelloral...| EFO_0000199|\n","|28404884|DFNA5 CpG4 methyl...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|Clinicopathologic...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|Due to the age di...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|The x-axis shows ...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|The plot shows DF...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","|28404884|We analyzed DFNA5...|breast adenocarci...|adenocarcinomabreast| EFO_0000304|\n","+--------+--------------------+--------------------+--------------------+-------------+\n","only showing top 20 rows\n","\n"]},{"name":"stderr","output_type":"stream","text":["\r"," \r"]},{"data":{"text/plain":["88"]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["pmid = '28404884'\n","\n","pub_df = (\n"," spark.read.parquet(matches_path)\n"," .select(\n"," 'pmid',\n"," 'text',\n"," 'label',\n"," 'labelN',\n"," 'keywordId'\n"," )\n"," .filter(\n"," (f.col(\"pmid\") == pmid) &\n"," (f.col('type') == 'DS')\n"," )\n"," .distinct()\n"," .persist()\n",")\n","\n","pub_df.show()\n","pub_df.count()"]},{"cell_type":"code","execution_count":14,"id":"4ada7169","metadata":{"ExecuteTime":{"end_time":"2024-07-05T08:22:43.578427Z","start_time":"2024-07-05T08:22:43.292469Z"}},"outputs":[{"name":"stderr","output_type":"stream","text":["24/07/05 08:22:43 WARN AdvancedInferFilter: expression (none#859 = none#1436)\n","24/07/05 08:22:43 WARN AdvancedInferFilter: expression (none#859 = none#1436)\n"]},{"name":"stdout","output_type":"stream","text":["+--------+------------------------+-------------------------------+\n","|pmid |labelN |name |\n","+--------+------------------------+-------------------------------+\n","|28404884|neoplasm |neoplasm |\n","|28404884|tumor |neoplasm |\n","|28404884|cancer |cancer |\n","|28404884|diseasmetastat |metastatic neoplasm |\n","|28404884|carcinomacelloralsquamou|oral squamous cell carcinoma |\n","|28404884|lymphmetastasinode |lymph node metastatic carcinoma|\n","|28404884|dci |breast ductal carcinoma in situ|\n","|28404884|gastrictumor |stomach neoplasm |\n","|28404884|adenocarcinoma |adenocarcinoma |\n","|28404884|adenocarcinomabreast |breast adenocarcinoma |\n","+--------+------------------------+-------------------------------+\n","\n"]}],"source":["(\n"," pub_df\n"," .join(\n"," disease_df, on='keywordId', how='inner'\n"," )\n"," .select('pmid', 'labelN', 'name')\n"," .distinct()\n"," .show(210, truncate=False)\n",")"]},{"cell_type":"code","execution_count":null,"id":"de3beeb7","metadata":{},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.9"}},"nbformat":4,"nbformat_minor":5} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment