Skip to content

Instantly share code, notes, and snippets.

@jacobtomlinson
Last active June 13, 2024 09:43
Show Gist options
  • Save jacobtomlinson/d101991adefc1e85e3dafd67ff7f1fb1 to your computer and use it in GitHub Desktop.
Save jacobtomlinson/d101991adefc1e85e3dafd67ff7f1fb1 to your computer and use it in GitHub Desktop.
Run dask-cuda on a SLURM HPC

When using LocalCUDACluster on a single node it is possible to scale your work out on a SLURM based HPC with a few small tweaks.

First install the Dask Runners package. (Note: this is a prototype and will be merged into dask-jobqueue in the future)

pip install git+https://github.com/jacobtomlinson/dask-hpc-runner.git

Then replace LocalCUDACluster with the SLURMRunner class.

from dask_hpc_runner import SlurmRunner

# Tell the SLURM Runner to use the Dask CUDA worker class
cluster = SlurmRunner(worker_class="dask_cuda.CUDAWorker")

Run your code on a SLURM system.

srun -n4 python code.py

That's it!

If you run this script outside of SLURM it will raise a RuntimeError, so if you want to make your code more flexible you could catch this and fall back to creating a LocalCUDACluster.

See example.py for a complete example.

from dask_hpc_runner import SlurmRunner
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
def main():
try:
cluster = SlurmRunner(
worker_class="dask_cuda.CUDAWorker",
worker_options={}, # Kwargs for CUDAWorker
)
except RuntimeError:
cluster = LocalCUDACluster()
client = Client(cluster.scheduler_address)
client.wait_for_workers(2)
# Run your Dask code here
client.run(lambda: print("hello"))
client.close()
cluster.close()
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment