Skip to content

Instantly share code, notes, and snippets.

@nik1806
Last active November 23, 2022 15:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nik1806/82d4d963c93c591ff3f4ee254c58bb4b to your computer and use it in GitHub Desktop.
Save nik1806/82d4d963c93c591ff3f4ee254c58bb4b to your computer and use it in GitHub Desktop.

Working with Slurm (GPU cluster)

1. Setup the working environment

  1. Download miniconda installer from https://docs.conda.io/en/latest/miniconda.html.
  2. Move to /netscratch/$USER and copy install_miniconda.sh there.
  3. Execute the file using: ./install_miniconda.sh. This will install a basic conda environment
  4. Create custom conda env: conda env create --file environment.yml.

2. Create execution files

  1. New usrun.sh file and put following in (update the username {username} and environment name {venv}, e.g., {username} to paliwal/ and {venv} to cless):
#!/bin/sh
srun -K -p V100-32GB --ntasks 1 --gpus-per-task 1 --cpus-per-gpu=4 --mem-per-cpu 24G\
    --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.10-py3.sqsh \
    --container-workdir="`pwd`" \
    --container-mounts=/netscratch/$USER:/netscratch/$USER,/netscratch/enroot:/netscratch/enroot,/ds:/ds:ro,"`pwd`":"`pwd`",/home/{username}/{Project_name}:/home/{username}/{Project_name},/netscratch/{username}/miniconda3/envs/{venv}:/opt/conda/envs/{venv},/home/{username}/.netrc:/home/{username}/.netrc \
    $*
    # Uncomment the below line and put above `$*` to run an interactive session 
    # --time 03:00:00 --pty /bin/bash \
    
## If installating packages don't work, run an interactive session. Use additional step to activate conda
## Then install
    # apt update
    # apt install tmux (optional)
    # apt-get install ffmpeg libsm6 libxext6  -y (optional)
    # conda init
    # source /opt/conda/bin/activate
    # conda activate {venv}
  1. New train.sh file and put following in (update environment name):
#!/bin/bash
source activate {venv}
# execution code
python pretrain_or_train_CLESS.py --conf_path confs/OpenEntity_CLESS_conf.py

3. Code execution

  1. Specify resources (e.g. GPU type, no. of GPUs, RAM etc.) in usrun.sh.
  2. Append commands to run in train.sh.
  3. Execute in terminal: ./usrun.sh train.sh.
  • For more information on slurm and parameters of srun look here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment