coin8086/pytorch-ddp.md

## pytorch-ddp.md

      
    Raw
  

              pytorch-ddp.md
            
          
    Distributed Model Training in PyTorch DDP

Prerequisites

Ensure GPU device is ready

Check GPU by lspci
Ensure enough free disk space

The cuda-toolkit package and PyTorch with CUDA support requires around 16 GB disk space to install.
Install CUDA toolkit

Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
Before you do:


Check supported OS and versions. We will use Ubuntu 20.04.


Check supported C/C++ compiler version. We will use GCC 9.4.


Ensure kernal headers and kernel development packages are installed with EXACT VERSION matching uname -r.

If you perform a system update which changes the version of the Linux kernel being used, make sure to rerun the commands below to ensure you have the correct kernel headers and kernel development packages installed. Otherwise, the CUDA Driver will fail to work with the new kernel.

Package name for headers is like linux-headers-5.15.0-1042, while package name for kernel development files is like linux-image-5.15.0-1042. Note the prefixes linux-headers- and linux-image-. The remaining part is usually determined by uname -r.


cuda-toolkit requires 6+GB disk space. Ensure you have enough free disk space (for /tmp).


Then install cudo toolkit on Ubuntu 20.04
Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu


sudo apt-get install linux-headers-$(uname -r)


Install keyring
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb


Install the meta package cuda-toolkit
sudo apt-get update
sudo apt-get install cuda-toolkit


When cuda-toolkit is installed, you could check the version of installed CUDA by nvcc --version. The version info is required to select a proper version of PyTorch.
Install PyTorch with CUDA support

Refer to https://pytorch.org/get-started/locally/
The correct install command depends on OS type and CUDA version. In our case the OS is Linux and the CUDA version is 12.x. So we choose
pip3 install torch torchvision torchaudio
Build your model on PyTorch DDP

Basically, you need to

Wrap your model in a DDP model.
Split your training dataset based on rank and world size at runtime so that each training process works on one subset.
Run your training script by a DDP command line.

DDP will take care of the training processes and data syncing among the processes across nodes.
See https://github.com/coin8086/ml-lab/tree/main/src/pytorch_ddp for sample code.
Refer to

https://pytorch.org/tutorials/beginner/dist_overview.html
https://pytorch.org/docs/stable/notes/ddp.html
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Run training job in HPC Pack

Setup a shared directory

Before you can run a training job, you need a shared directory that can be accessed by all the compute nodes. The directory is used for training code and data (both input data set and output trained model).
You can setup a SMB share directory on a head node and then mount it on each compute node with cifs, like this:


On a head node, make a directory app under %CCP_DATA%\SpoolDir, which is already shared as CcpSpoolDir by HPC Pack by default.


On a compute node, mount the app directory like
sudo mkdir /app
sudo mount -t cifs //<your head node name>/CcpSpoolDir/app /app -o vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777
NOTE:

The password option can be omitted in an interactive shell. You will be prompted for it in that case.
The dir_mode and file_mode is set to 0777, so that any Linux user can read/write it. A restricted permission is possible, but more complicated to be configurated.


Optionally, make the mounting permanently by adding a line in /etc/fstab like
//<your head node name>/CcpSpoolDir/app cifs vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777 0 2

Here the password is required.


Start a training job

Using the sample code, download the following files into the shared directory %CCP_DATA%\SpoolDir\app

neural_network.py
operations.py
run_ddp.py

Then create a job with node as resource unit. The job's tasks command lines are all the same, like
python3 -m torch.distributed.run --nnodes=<the number of compute nodes> --nproc_per_node=<the processes on each node> --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=<a node name>:29400 /app/run_ddp.py
Note:

nnodes specifies the number of compute nodes for your training job.
nproc_per_node specifies the number of processes on each compute node. It can not exceed the number of GPUs on a node. That is, one GPU can have one process at most.
rdzv_endpoint specifies a name and a port of a node that acts as a Rendezvous. Any node in the training job can work.
"/app/run_ddp.py" is the path to your training code file. Remember that /app is a shared directory on the head node.

Refer to

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#initialize-ddp-with-torch-distributed-run-torchrun
https://pytorch.org/docs/stable/elastic/run.html

Issues found in HPC Pack


When "Run Command", check "Run as local system account NT AUTHORITY/SYSTEM" to run as root on Linux node.


When seting up the environment by "Run Command", "nvcc --version" failed after cuda-toolkit was installed. The error is
IaaSCN116 -> Failed
---------------------------------------------------------------------------------------------------
/tmp/nodemanager_task_374_0.SlfRQe/cmd.sh: line 3: nvcc: command not found

Task failed during execution with exit code . Please check task's output for error details.  

However the command succeeded in another SSH shell. It seems the "Run Command" shell doesn't have proper PATH, which can be told by
echo "$(IFS=: ; for p in $PATH; do echo "$p"; done)"
It can be corrected by bash -ic "nvcc --version" in "Run command", forcing an interactive shell by -i, and thus reading /etc/bash.bashrc, which has corerct path setup for CUDA stuff.