Skip to content

Instantly share code, notes, and snippets.

@c-w
Created January 4, 2019 20:53
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save c-w/5cbf03f9578aaf45f4eb8c04664d24d1 to your computer and use it in GitHub Desktop.
Save c-w/5cbf03f9578aaf45f4eb8c04664d24d1 to your computer and use it in GitHub Desktop.
Running notebooks in parallel on Azure Databricks
// define the name of the Azure Databricks notebook to run
val notebookToRun = ???
// define some way to generate a sequence of workloads to run
val jobArguments = ???
// define the number of workers per job
val workersPerJob = ???
import java.util.concurrent.Executors
import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration.Duration
// look up the number of workers in the cluster
val workersAvailable = sc.getExecutorMemoryStatus.size
// determine number of jobs we can run each with the desired worker count
val totalJobs = workersAvailable / workersPerJob
// look up required context for parallel run calls
val context = dbutils.notebook.getContext()
// create threadpool for parallel runs
implicit val executionContext = ExecutionContext.fromExecutorService(
Executors.newFixedThreadPool(totalJobs))
try {
val futures = jobArguments.zipWithIndex.map { case (args, i) =>
Future({
// ensure thread knows about databricks context
dbutils.notebook.setContext(context)
// define up to maxJobs separate scheduler pools
sc.setLocalProperty("spark.scheduler.pool", s"pool${i % totalJobs}")
// start the job in the scheduler pool
dbutils.notebook.run(notebookToRun, timeoutSeconds = 0, args)
})}
// wait for all the jobs to finish processing
Await.result(Future.sequence(futures), atMost = Duration.Inf)
} finally {
// ensure to clean up the threadpool
executionContext.shutdownNow()
}
@gbrueckl
Copy link

Hi Clemens,
Do you know whether it is possible to modify the ctx received from dbutils.notebook.getContext() before passing it back into dbutils.notebook.setContext(ctx)?

I would like to change/set the notebook_path under ctx.extraConfigs as it is not set when running commands via the Command Execution API

Kind regards
-gerhard

@c-w
Copy link
Author

c-w commented Oct 19, 2022

I haven't tried it (and haven't been too close to the Databricks world in a while) so the direct answer is: I don't know. However, I assume that it should be possible as long as the context doesn't rely on configurations specific to each future (it's a singleton and we can only have one active context per JVM). Could you try it out and report back?

@gbrueckl
Copy link

Well, I tried to change it but for example, when trying to update ctx.extraConfigs["notebook_path"] = "new_path" it is complaining that update is not implemented for that property - so simply changing it does not work

my idea would now be to create a complete new context and use it for dbutils.notebook.setContext(ctx) but I have no idea how to create a new instance of the ctx object (and I am even more concerned if that works if you are saying its a singleton, which apparently we cannot update)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment