Skip to content

Instantly share code, notes, and snippets.

@d0choa
Last active September 15, 2021 17:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save d0choa/a4b80da92af587391688e6930be55b1b to your computer and use it in GitHub Desktop.
Save d0choa/a4b80da92af587391688e6930be55b1b to your computer and use it in GitHub Desktop.
Target specific (NOD2) coding variants (GWAS + ClinVar pathogenic)
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession
sparkConf = SparkConf()
spark = (
SparkSession.builder
.config(conf=sparkConf)
.master('yarn')
.getOrCreate()
)
# Platform evidence data
evidence = spark.read.parquet("gs://open-targets-data-releases/21.06/output/etl/parquet/evidence")
out = (
evidence
.filter(F.col("targetId") == "ENSG00000167207")
.filter(F.col("variantId").isNotNull())
.filter((F.col("variantFunctionalConsequenceId") == "SO_0001583") | (F.col("variantFunctionalConsequenceId") =="SO:0001587"))
# .withColumn("clinicalSignificances", F.explode("clinicalSignificances"))
.filter((F.col("datasourceId") == "ot_genetics_portal") | (F.array_contains(F.col("clinicalSignificances"), "pathogenic")))
.persist()
.select("datasourceId", "diseaseId", "variantId", "variantRsId", "variantFunctionalConsequenceId", "clinicalSignificances", "literature")
.sort("variantId")
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment