Prom is 36 node DGX based Slurm cluster. There are three main partitions:
- main/batch: max 4 nodes per user
- bigjob partition: max 16 nodes per user
- backfill partition: no limits, but jobs are lower priority and pre-emptible
Below are two scripts: dask-scheduler.script
and dask-cuda-worker.script
. For the interactive workflows I think we should do the following:
- Allocate a node for interactive use: salloc -N1 bash -- this will allocate a node we can ssh into (the client)
- start scheduler and set of dask-cuda workers:
sbatch dask-scheduler.script
-- scheduler on main/batch partition