Skip to content

Instantly share code, notes, and snippets.

@pavel-filatov
Created May 12, 2021 08:02
Show Gist options
  • Save pavel-filatov/aaea22e304bfdb509866f13034df0d80 to your computer and use it in GitHub Desktop.
Save pavel-filatov/aaea22e304bfdb509866f13034df0d80 to your computer and use it in GitHub Desktop.
Parallel processing with Scala-Spark
object ParallelProcessing {
val queries: List[(String, String)] = List(
("SELECT * FROM ABC", "output1"),
("SELECT * FROM XYZ", "output2")
)
// Just use parallel collection instead of futures, that's it
queries.par foreach {
case (query, path) =>
val dataPath = s"${pathPrefix}/{path}"
executeAndSave(query, dataPath)
}
def executeAndSave(query: String, dataPath: String)(implicit context: Context): Unit = {
println(s"$query starts")
context.spark.sql(query).write.mode("overwrite").parquet(dataPath)
println(s"$query completes")
}
}
@neelbando
Copy link

Hi Pavel.. Is there any way to implement this in Pyspark?

@pavel-filatov
Copy link
Author

Hi @neelbando, nice to hear from you! Didn't expect somebody would actually check this gist :D

This is possible, please check another gist: https://gist.github.com/pavel-filatov/87a68dd621546b9cac1e0d2ea269705f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment