Skip to content

Instantly share code, notes, and snippets.

@csarron
Last active August 25, 2022 21:19
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save csarron/8f3984150dd1760ec2a1406d4e00b67d to your computer and use it in GitHub Desktop.
Save csarron/8f3984150dd1760ec2a1406d4e00b67d to your computer and use it in GitHub Desktop.

Two main workarounds for mitigating the hyak io issues

  • containerizing job environment, apptainer is recommended by Hyak team for both speeding up python startup time and reproducibility

  • copying frequently used data to /tmp dir on the node, /tmp as described by Hyak team has around 400GB isolated fast SSD storage, and loading/saving data there won't affect others' jobs or slowdown hyak

Build a container image on the gpu node

using alloc to create an interactive session, e.g. salloc -c 8 -p ckpt --time=5-00:00 -n 1 --mem=64G --gpus=a40:1

then module load apptainer

put the following content into a file called app.def (see official docs more info):

Bootstrap: docker
From: nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu20.04

%setup
    echo "SETUP"
    echo "$HOME"
    echo `pwd`

%files
    /tmp/requirements.txt

%post
    # Downloads the latest package lists (important).
    apt-get update -y
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        python3 \
        python3-pip \
        python3-setuptools

    # set python3 to be default python
    update-alternatives --install /usr/bin/python python /usr/bin/python3 1
    
    # Install Python modules.
    pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
    pip install -r /tmp/requirements.txt

    # Reduce the size of the image by deleting the package lists we downloaded,
    # which are useless now.
    apt-get autoremove -y && \
    apt-get clean && \
    rm -rf /root/.cache && \
    rm -rf /var/lib/apt/lists/*

    # see if torch works
    python -m torch.utils.collect_env

    python -c 'import site; print(site.getsitepackages())'

%environment
    export PATH="$HOME/.local/bin:$PATH"

then apptainer build --nv --fakeroot /tmp/app.sif app.def (--nv is for exposing GPU env, --fakeroot is for avoiding root permission)

now you can shell into the container to finish your setup, like install python packages and run setup commands. apptainer shell -B $(pwd) --nv app.sif (-B means binding folder to the container so you can read/write files inside the container.) once the environment setup is done, just exit from the container.

Run a job inside the container image

next time you can just exec the container to run your jobs: (install pkg from source): apptainer exec -B $(pwd) --nv app.sif pip install -e .

or run python jobs: apptainer exec -B $(pwd) --nv elv-app.sif python xxx.py --args xxx

Example run script (job.sh)

#!/bin/bash
module load apptainer

# copy frequently accessed data to /tmp dir, it's much faster!
mkdir -p /tmp/data
cp -r data/training* /tmp/data

n_gpus=$(nvidia-smi --list-gpus | wc -l)

ckpt=ckpt/fancy_model

# fire up a distributed training job
apptainer exec -B $(pwd) --nv app.sif torchrun --nnodes $SLURM_JOB_NUM_NODES --nproc_per_node ${n_gpus}  cli/run.py \
    task=xxx \
    task.image_db_dir=/tmp/data \
    num_workers=$(nproc) \
    train.ckpt=${ckpt} \
    train.epochs=xx \
    train.scheduler.warmup_ratio=0.01 \
    train.optimizer.learning_rate=1e-4 \
    2>&1 | tee data/${ckpt//\//_}.log

Example slurm sbatch script (support multi-nodes multi-gpus)

job.sh is the above example run script.

#!/bin/bash
#SBATCH --job-name=job
#SBATCH --output=data/slurm/job.out
#SBATCH --error=data/slurm/job.err
#SBATCH --time=1-00:00
#SBATCH --account=cse
#SBATCH --partition=gpu-rtx6k
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --gpus=4
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@uw.edu
#SBATCH --signal=B:TERM@120
#SBATCH --exclude=g3007

sbatch_args=$0
echo $sbatch_args

echo "======>start job...."
echo

echo "$(date): job $SLURM_JOBID starting on $SLURM_NODELIST"
srun job.sh

FAQs

  • possible home path permission issue (e.g. you modifided the default HOME path): you can prepend HOME var to apptainer build (e.g. HOME=/mmfs1/home/$USER/ apptainer build --nv --fakeroot /tmp/app.sif app.def)

  • docker source (e.g. you need to compile CUDA src): try find suitable base image from https://hub.docker.com/r/nvidia/cuda/tags or https://catalog.ngc.nvidia.com/containers

  • python package location: /usr/local/lib/python3.8/ (in fact, this is the key to avoid loading/import python modules on normal storage place like /mmfs1/home/$USER/.local)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment