Skip to content

Instantly share code, notes, and snippets.

@BramVanroy
Last active January 23, 2023 08:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save BramVanroy/c35066453b028c3d26ef3b002fd927e3 to your computer and use it in GitHub Desktop.
Save BramVanroy/c35066453b028c3d26ef3b002fd927e3 to your computer and use it in GitHub Desktop.
Combining LMOD with DeepSpeed. As a bonus, also add a command to automatically generate a hostfile.
# If we open a session/job that's on a host that starts with gpu* (e.g. gpu512.dodrio.os),
# load PyTorch with CUDA and pdsh
# This makes sure that deepspeed/pdsh work in multi node settings
if [[ $(hostname) == gpu* ]]; then
module load PyTorch/1.12.0-foss-2022a-CUDA-11.7.0;
module load pdsh/2.34-GCCcore-11.3.0;
fi
# Automatically generates a hostfile for the current job in the current directory,
# containing each node name with its number of available GPUs, e.g.:
# gpu512.dodrio.os slots=4
# gpu513.dodrio.os slots=4
mkhostfile() {
if [[ -v SLURM_JOB_NODELIST ]]; then
rm -rf hostfile;
echo "# Automatically generated hostfile" > hostfile;
IFS=',';
read -ra arr <<< "$SLURM_JOB_NODELIST";
for node in ${SLURM_JOB_NODELIST[@]}; do
n_gpus=$(PDSH_RCMD_TYPE=ssh pdsh -w $node nvidia-smi -L | wc -l);
echo "$node slots=$n_gpus" >> hostfile;
done
else
echo "Error: SLURM_JOB_NODELIST environment variable is not set so cannot automatically create hostfile";
fi
}
@BramVanroy
Copy link
Author

BramVanroy commented Jan 21, 2023

Add these lines above to your .bashrc.

The first if-statement is necessary because DeepSpeed uses pdsh. So upon ssh'ing into a new host, the required modules are not loaded in lmod there yet. By adding this check in our .bashrc we make sure that at every new session, whenever a node's hostname starts with gpu, the right modules are loaded - even in a pdsh ssh session.

The second part is a function to automatically generate a hostfile for DeepSpeed in the format

hostname1 slots=n_gpus1
hostname2 slots=n_gpus2
...

This can be useful to automate your jobs. You could for instance run this function in your PBS script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment