David M. Rogers, National Center for Computational Science, Oak Ridge National Laboratory July 12, 2023.
All-reduce is a core functionality of HPC applications like iterative solvers, which sum residuals after a parameter update. AI/ML methods fall in this category, performing a sum-reduction over derivatives of each ML parameter. Another class of codes like principal component analysis and electronic structure methods use all-reduce to create dot-products over large distributed vectors.
The NCCL library provided by NVIDIA optimizes the communication graph used by all-reduce to prioritize intra-node communication, where bandwidth is higher. As a consequence, it achieves higher overall bandwidth (Jeaugey, 2019). Unmodified NCCL does not make use of libfabric, and defaults to using TCP sockets on Slingshot-11 (TM) interconnect hardware. However, a plugin has been published by AWS, [(Kim, Kheria, Inozemtsev, 2018-2022)]