Skip to content

Instantly share code, notes, and snippets.

@coin8086
Last active December 29, 2023 08:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save coin8086/e318f80b978db391680bf671e0f2c885 to your computer and use it in GitHub Desktop.
Save coin8086/e318f80b978db391680bf671e0f2c885 to your computer and use it in GitHub Desktop.
Distributed Model Training in PyTorch DDP

Distributed Model Training in PyTorch DDP

Prerequisites

Ensure GPU device is ready

Check GPU by lspci

Ensure enough free disk space

The cuda-toolkit package and PyTorch with CUDA support requires around 16 GB disk space to install.

Install CUDA toolkit

Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Before you do:

  1. Check supported OS and versions. We will use Ubuntu 20.04.

  2. Check supported C/C++ compiler version. We will use GCC 9.4.

  3. Ensure kernal headers and kernel development packages are installed with EXACT VERSION matching uname -r.

    If you perform a system update which changes the version of the Linux kernel being used, make sure to rerun the commands below to ensure you have the correct kernel headers and kernel development packages installed. Otherwise, the CUDA Driver will fail to work with the new kernel.

    Package name for headers is like linux-headers-5.15.0-1042, while package name for kernel development files is like linux-image-5.15.0-1042. Note the prefixes linux-headers- and linux-image-. The remaining part is usually determined by uname -r.

  4. cuda-toolkit requires 6+GB disk space. Ensure you have enough free disk space (for /tmp).

Then install cudo toolkit on Ubuntu 20.04

Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu

  1. sudo apt-get install linux-headers-$(uname -r)

  2. Install keyring

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
  3. Install the meta package cuda-toolkit

    sudo apt-get update
    sudo apt-get install cuda-toolkit

When cuda-toolkit is installed, you could check the version of installed CUDA by nvcc --version. The version info is required to select a proper version of PyTorch.

Install PyTorch with CUDA support

Refer to https://pytorch.org/get-started/locally/

The correct install command depends on OS type and CUDA version. In our case the OS is Linux and the CUDA version is 12.x. So we choose

pip3 install torch torchvision torchaudio

Build your model on PyTorch DDP

Basically, you need to

  1. Wrap your model in a DDP model.
  2. Split your training dataset based on rank and world size at runtime so that each training process works on one subset.
  3. Run your training script by a DDP command line.

DDP will take care of the training processes and data syncing among the processes across nodes.

See https://github.com/coin8086/ml-lab/tree/main/src/pytorch_ddp for sample code.

Refer to

Run training job in HPC Pack

Setup a shared directory

Before you can run a training job, you need a shared directory that can be accessed by all the compute nodes. The directory is used for training code and data (both input data set and output trained model).

You can setup a SMB share directory on a head node and then mount it on each compute node with cifs, like this:

  1. On a head node, make a directory app under %CCP_DATA%\SpoolDir, which is already shared as CcpSpoolDir by HPC Pack by default.

  2. On a compute node, mount the app directory like

    sudo mkdir /app
    sudo mount -t cifs //<your head node name>/CcpSpoolDir/app /app -o vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777

    NOTE:

    • The password option can be omitted in an interactive shell. You will be prompted for it in that case.
    • The dir_mode and file_mode is set to 0777, so that any Linux user can read/write it. A restricted permission is possible, but more complicated to be configurated.
  3. Optionally, make the mounting permanently by adding a line in /etc/fstab like

    //<your head node name>/CcpSpoolDir/app cifs vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777 0 2
    

    Here the password is required.

Start a training job

Using the sample code, download the following files into the shared directory %CCP_DATA%\SpoolDir\app

  • neural_network.py
  • operations.py
  • run_ddp.py

Then create a job with node as resource unit. The job's tasks command lines are all the same, like

python3 -m torch.distributed.run --nnodes=<the number of compute nodes> --nproc_per_node=<the processes on each node> --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=<a node name>:29400 /app/run_ddp.py

Note:

  • nnodes specifies the number of compute nodes for your training job.
  • nproc_per_node specifies the number of processes on each compute node. It can not exceed the number of GPUs on a node. That is, one GPU can have one process at most.
  • rdzv_endpoint specifies a name and a port of a node that acts as a Rendezvous. Any node in the training job can work.
  • "/app/run_ddp.py" is the path to your training code file. Remember that /app is a shared directory on the head node.

Refer to

Issues found in HPC Pack

  1. When "Run Command", check "Run as local system account NT AUTHORITY/SYSTEM" to run as root on Linux node.

  2. When seting up the environment by "Run Command", "nvcc --version" failed after cuda-toolkit was installed. The error is

    IaaSCN116 -> Failed
    ---------------------------------------------------------------------------------------------------
    /tmp/nodemanager_task_374_0.SlfRQe/cmd.sh: line 3: nvcc: command not found
    
    Task failed during execution with exit code . Please check task's output for error details.  
    

    However the command succeeded in another SSH shell. It seems the "Run Command" shell doesn't have proper PATH, which can be told by

    echo "$(IFS=: ; for p in $PATH; do echo "$p"; done)"

    It can be corrected by bash -ic "nvcc --version" in "Run command", forcing an interactive shell by -i, and thus reading /etc/bash.bashrc, which has corerct path setup for CUDA stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment