Skip to content

Instantly share code, notes, and snippets.

@falkerl
Last active March 9, 2021 19:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save falkerl/be03565e65d31026cf748ebc7338b5c2 to your computer and use it in GitHub Desktop.
Save falkerl/be03565e65d31026cf748ebc7338b5c2 to your computer and use it in GitHub Desktop.
Vaccine combinations
val df = spark.read.option("header", true)
.csv("/Users/elena/Downloads/vaccine_combinations.csv")
df.createTempView("data")
val diseases = df.columns.filter(_ != "ID")
diseases.map(d => df.where(col(d) === lit(1)).select(col("ID"), lit(d).as("disease")))
.reduce(_ union _)
.createTempView("vac2dis")
spark.sql(
"""select count(*)
|from data as v1
|join data as v2 on v1.ID < v2.ID
|where not exists (
| select 1
| from vac2dis d1
| join vac2dis d2
| on d1.disease = d2.disease
| where d1.ID = v1.ID and d2.ID = v2.ID
|)
|""".stripMargin
).show()
@falkerl
Copy link
Author

falkerl commented Mar 9, 2021

The dataset can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment