Skip to content

Instantly share code, notes, and snippets.

@rkube
Last active September 16, 2020 13:05
Show Gist options
  • Save rkube/54d9c746108b3762977bb7c0b97b9386 to your computer and use it in GitHub Desktop.
Save rkube/54d9c746108b3762977bb7c0b97b9386 to your computer and use it in GitHub Desktop.
Example slurm script for ray
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=1GB
#SBATCH --nodes=4
#SBATCH --tasks-per-node=1
#SBATCH --time=00:30:00
#SBATCH --reservation=test
# Deduce number of worker nodes from number total number of tasks.
# One node is reserved for the head node.
let "worker_num=(${SLURM_NTASKS} - 1)"
# Define the total number of worker CPU cores available to ray
let "total_cores=${worker_num} * ${SLURM_CPUS_PER_TASK}"
suffix='6379'
ip_head=`hostname`:$suffix
export ip_head # Exporting for latter access by trainer.py
srun -N 1 -n 1 -c ${SLURM_CPUS_PER_TASK} -w `hostname` ray start --head --block --dashboard-host 0.0.0.0 --port=6379 &
sleep 5
# Make sure the head successfully starts before any worker does, otherwise
# the worker will not be able to connect to redis. In case of longer delay,
# adjust the sleeptime above to ensure proper order.
srun -N 3 -n 3 -c ${SLURM_CPUS_PER_TASK} -x `hostname` ray start --address $ip_head --block --num-cpus 5 &
sleep 5
python -u trainer.py foobar ${total_cores} # Pass the total number of allocated CPUs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment