Working with Slurm (GPU cluster)
1. Setup the working environment
Download miniconda installer from https://docs.conda.io/en/latest/miniconda.html
.
Move to /netscratch/$USER
and copy install_miniconda.sh
there.
Execute the file using: ./install_miniconda.sh
. This will install a basic conda environment
Create custom conda env: conda env create --file environment.yml
.
2. Create execution files
New usrun.sh
file and put following in (update the username {username}
and environment name {venv}
,
e.g., {username}
to paliwal/
and {venv}
to cless
):
#!/bin/sh
srun -K -p V100-32GB --ntasks 1 --gpus-per-task 1 --cpus-per-gpu=4 --mem-per-cpu 24G\
--container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.10-py3.sqsh \
--container-workdir="`pwd`" \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/netscratch/enroot:/netscratch/enroot,/ds:/ds:ro,"`pwd`":"`pwd`",/home/{username}/{Project_name}:/home/{username}/{Project_name},/netscratch/{username}/miniconda3/envs/{venv}:/opt/conda/envs/{venv},/home/{username}/.netrc:/home/{username}/.netrc \
$*
# Uncomment the below line and put above `$*` to run an interactive session
# --time 03:00:00 --pty /bin/bash \
## If installating packages don't work, run an interactive session. Use additional step to activate conda
## Then install
# apt update
# apt install tmux (optional)
# apt-get install ffmpeg libsm6 libxext6 -y (optional)
# conda init
# source /opt/conda/bin/activate
# conda activate {venv}
New train.sh
file and put following in (update environment name):
#!/bin/bash
source activate {venv}
# execution code
python pretrain_or_train_CLESS.py --conf_path confs/OpenEntity_CLESS_conf.py
Specify resources (e.g. GPU type, no. of GPUs, RAM etc.) in usrun.sh
.
Append commands to run in train.sh
.
Execute in terminal: ./usrun.sh train.sh
.
For more information on slurm and parameters of srun
look here .