Skip to content

Instantly share code, notes, and snippets.

@frobnitzem
Last active November 29, 2023 21:30
Show Gist options
  • Save frobnitzem/0bd3be6084115d7d58d30e6753abb251 to your computer and use it in GitHub Desktop.
Save frobnitzem/0bd3be6084115d7d58d30e6753abb251 to your computer and use it in GitHub Desktop.
RCCL AllReduce Performance

David M. Rogers, National Center for Computational Science, Oak Ridge National Laboratory July 12, 2023.

All-reduce is a core functionality of HPC applications like iterative solvers, which sum residuals after a parameter update. AI/ML methods fall in this category, performing a sum-reduction over derivatives of each ML parameter. Another class of codes like principal component analysis and electronic structure methods use all-reduce to create dot-products over large distributed vectors.

The NCCL library provided by NVIDIA optimizes the communication graph used by all-reduce to prioritize intra-node communication, where bandwidth is higher. As a consequence, it achieves higher overall bandwidth (Jeaugey, 2019). Unmodified NCCL does not make use of libfabric, and defaults to using TCP sockets on Slingshot-11 (TM) interconnect hardware. However, a plugin has been published by AWS, (Kim, Kheria, Inozemtsev, 2018-2022), that allows NCCL to use libfabric. This plugin has been run on NERSC's Slingshot-11 (TM) network with support from NVIDIA and HPE and showed very good bandwidth scaling at the 2023 ECP annual meeting (Sudip Dosanjh, throughput plot for FourCastNet on slide 19, ECP Annual Meeting 2023). Without the plugin, Perlmutter had 2-3x worse bandwidth with NCCL on Slingshot 11 (TM) vs. Slingshot 10 (TM) on nccl-tests (Addison and Jeaugey, 2018-2022) - a 32 MB All-reduce. Perlmutter has 1 EPYC7763, 4 A100 GPUs, and 4 Slingshot 11 connections per node.  The plug-in resulted in a 3x improvement vs. Slingshot 10 (TM).

In this note, we reproduce the above scaling study using the HIP-translated AMD packages,

Build Recipes

I set up my environment variables using the following set of modules. At present, these modules are installed on both Crusher and Frontier -- so the workflow is very likely to transfer directly. Executables and libraries may even be binary compatible, so re-compilation can be skipped.

Environment Setup

# env.sh, to be sourced
# Load modules mirroring the Frontier environment using cray compilers
# This file is suitable for running on either OLCF Crusher or Frontier

module load PrgEnv-cray/8.3.3
module load netlib-scalapack/2.2.0
module load cray-fftw/3.3.10.3
module load cray-hdf5-parallel/1.12.1.1
module load craype-accel-amd-gfx90a
module load rocm/5.2.0
module load cmake/3.23.2

module load craype/2.7.19
module load cray-pmi/6.1.3
module load cray-mpich/8.1.17

module load cray-libsci/22.06.1.3

WD=$PWD # set to top-level of local install path

Note that the center is in the process of deprecating "rocm" module names in favor of "amd", e.g. amd/5.2.0 instead of rocm/5.2.0. The amd module names is consistent with the cce and gcc modules, for which is is an alternative. However, amd/5.2.0 does not define OLCF_ROCM_ROOT like rocm/5.2.0. Instead, it defines CRAY_AMD_COMPILER_PREFIX. Both define ROCM_PATH. All of these variables currently point to /opt/rocm-5.2.0.

RCCL

Here is the recipe for compiling and installing RCCL. I did not end up using this, since the system contains a compiled rccl at /opt/rocm-5.2.0/rccl.

git clone -b release/rocm-rel-5.2 \
    https://github.com/ROCmSoftwarePlatform/rccl.git

cd rccl
mkdir build && cd build

cmake -DCMAKE_CXX_COMPILER=hipcc \
      -DROCM_PATH=$ROCM_PATH \
      -DCMAKE_HIP_ARCHITECTURES=gfx90a \
      -DCMAKE_INSTALL_PREFIX=$WD \
      ..

AWS-OFI-RCCL

This is a plug-in to RCCL:

git clone https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl
git checkout 8afccecad8c6987163d2783680555a5da51c5452 # for rocm 5.2.0
git checkout 52cbc4766401be4410fd0aa5207f2f9b6226c321 # latest for rocm 5.4.x

./autogen.sh
GCC=hipcc \
CC=hipcc \
CXX=hipcc \
CFLAGS="-I$OLCF_ROCM_ROOT/rccl/include -I$OLCF_ROCM_ROOT/include" \
LDFLAGS="-Wl,-rpath-link,$OLCF_ROCM_ROOT/lib -L$OLCF_ROCM_ROOT/lib" \
  ./configure \
    --with-libfabric=/opt/cray/libfabric/1.15.0.0 \
    --with-hip=$OLCF_ROCM_ROOT/hip \
    --with-rccl=$OLCF_ROCM_ROOT \
    --with-mpi=$MPICH_DIR \
    --prefix=$WD
make -j8
make install

It can be enabled by adding install location of aws-ofi-rccl's librccl-net.so to LD_LIBRARY_PATH.

Tests

RCCL-tests (compile without MPI)

$ git clone https://github.com/ROCmSoftwarePlatform/rccl-tests
$ cd rccl-tests
$ git checkout 3fbd3280ce2d747a902176a3c87c8c49c32f3fc2
$ make HIP_HOME=$ROCM_PATH NCCL_HOME=$ROCM_PATH/nccl

$ ./build/all_reduce_perf -b32M -e 32M -f 2 -g 2
# nThreads: 1 nGpus: 2 nRanks: 1 minBytes: 33554432 maxBytes: 33554432 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid  96230 on     login2 device  0 [0000:c1:00.0] 
#   Rank  1 Pid  96230 on     login2 device  1 [0000:c6:00.0] 
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608     float     sum    356.0   94.25   94.25  0e+00    356.3   94.18   94.18  0e+00
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 94.2173 

I compiled without MPI first as a control. The b, e, and f flags have the effect of selecting only the 32MB all-reduce test size. The g flag tells the program to run two threads, one per gpu. Running the result on the crusher login node provides a good bandwidth between its two GPUs.

RCCL-tests (compile with MPI)

After running this a few times with slurm process binding enabled, I realized I needed to fix NCCL-test's manual selection of GPUs, which defaulted to gpu = local mpi rank. That resulted in errors like crusher016: Test HIP failure all_reduce.cu:41 'hipErrorInvalidDevice'. So I modified src/all_reduce.cu and src/common.cu to comment out all if (args->enable_multiranks) and if (enable_multiranks) checks. Those checks guarded the line gpuid = gpuid % args->localNumDevices, which was needed to ensure that, with process binding, gpuid was always 0, or, without process binding, that gpuid was unaffected. My updates are posted at: https://github.com/frobnitzem/rccl-tests/tree/multigpu

This may not be needed on recent RCCL-tests if they have looked at ROCR_VISIBLE_DEVICES.

The compile recipe works regardless of the patch mentioned above:

$ make MPI=1 MPI_HOME=$MPICH_DIR HIP_HOME=$ROCM_PATH NCCL_HOME=$ROCM_PATH/nccl LDFLAGS="-L$CRAY_MPICH_ROOTDIR/gtl/lib -lmpi_gtl_hsa"

$ srun -t 1 -A stf006 -N 1 -n 8 -c7 --gpus-per-task 1 --gpu-bind=closest ./build/all_reduce_perf -b32M -e 32M -f 2 -g 1
Avg bus bandwidth    : 9.5696

$ srun -t 1 -A stf006 -N 1 -n 1 -c56 --gpus-per-task 8 --gpu-bind=closest ./build/all_reduce_perf -b32M -e 32M -f 2 -g 8

PE 0: 
MPICH ERROR: Unable to use a NIC_POLICY of 'NUMA'. Rank 0 is not confined to a single NUMA node.  There are 4 numa_nodes detected (rc=0).

$ srun -t 1 -A stf006 -N1 -n8 -c7 ./build/all_reduce_perf -b32M -e 32M -f 2 -g 1
# note incorrect binding, gpu = rank
Avg bus bandwidth    : 107.129

$ srun -t 1 -A stf006 -N2 -n16 -c7 ./build/all_reduce_perf -b32M -e 32M -f 2 -g 1
# note incorrect binding, gpu = rank % 8
Avg bus bandwidth    : 5.87527

In the first test, using the correct MPI rank-to-gpu binding, the GCD to GCD bandwidth is reduced a lot. This is likely because inter-GPU communication is going through TCP sockets. In the second test, we tried creating one task with all 8 GCDs on a node. However, MPI complains about this setup. Likely this is due to the --gpu-bind=closest being confusing to slurm and MPI (because CPU to GPU mapping is important). In the last two tests, we see excellent intra-node bandwidth when no process binding is used, but poor inter-node bandwidth.

Those notes on process binding are referencing the fact that certain CPUs are "close" to certain GCDs on a Frontier node. The optimal mapping is documented in the OLCF User Guide.

Now we run the same tests using the libfabric plug-in:

$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$WD/lib srun -t 1 -A stf006 -N1 -n8 -c7 ./build/all_reduce_perf -b32M -e 32M -f 2 -g 1
# note incorrect binding, gpu = rank
Avg bus bandwidth    : 107.259

$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$WD/lib srun -t 1 -A stf006 -N2 -n16 -c7 ./build/all_reduce_perf -b32M -e 32M -f 2 -g 1
# note incorrect binding, gpu = rank % 8
Avg bus bandwidth    : 42.7237

$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$WD/lib srun -t 1 -A stf006 -N4 -n32 -c7 ./build/all_reduce_perf -b32M -e 32M -f 2 -g 1
Avg bus bandwidth    : 25.8216

I also confirmed the program is loading the plugin by setting the environment variable NCCL_DEBUG=INFO. With this set, the output contains NCCL INFO NET/OFI Using aws-ofi-rccl 1.2.0

Results

Table 1 summarizes timing values reported by all_reduce_perf. Note that multiple tests were run, so the average bandwidth may be different than the average time. The relationship between them for one trial is, $$ \mathrm{bandwidth} = \frac{2 \cdot 32 \mathrm{ MB} }{ (\mathrm{ofi; time})}$$

Table 1, Summary of timing values reported by all_reduce_perf.

nodes tcp time / us ofi time / us bandwidth / GB/s
1 547.4 547.5 114
2 10947 1471.9 42.5
4 11630 2215.0 14.1
8 14625 3073.3 21.91
16 3423.5 18.3
32 3281.7 19.0
960 10481 5.96
960 12004 5.20
3000 136298 0.46

These results compare favorably with Frontier's peak bandwidth specs, which are 200+200 GB/s between GCDs on the same die, 50+50 GB/s between GPUs on the same node, and 25+25 GB/s between GPUs on different nodes. Because each node has 4 network links, each of which are 25+25 GB/s, there may even be extra performance to be gained by further adapting RCCL to the Slingshot 11 (TM) network topology.

Note that startup times for RCCL can be very slow (order 10 minutes), especially for large node counts. Also, if the environment variable MPICH_SMP_SINGLE_COPY_MODE=NONE is set, then RCCL with the plugin appears to hang (tried on the the 3000 node case), since it does not complete its initialization phase.

Concluding Thoughts

The efficient all-reduce algorithm implemented by NCCL (and RCCL) is worth the programming effort to utilize because it provides an advantage over the native MPI all-reduce. Those advantages come from work by the library to understand network topology and the use of low-level API calls to transfer data directly. Ideally, those improvements would be incorporated back into MPI's allreduce call, eliminating the programming burden of adding more libraries. However, the MPI specification itself makes this integration difficult because its paradigm is becoming out-dated. A major part of the issue is that MPI calls are made from the CPU, while data transfers are happening on the GPU. Another difficulty comes from the need for utilizing MPI process topologies, CPU, and thread mapping correctly within MPI. These are not directly addressed by MPI's list of function calls, but are more semantic in nature. NCCL by-passes these issues by providing its own semantics for threads, events, and streams for a subset of MPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment