- Software version: MVAPICH 2.3.2-GDR
- Submitter: Andreas Herten (Jülich Supercomputing Center (JSC), Forschungszentrum Jülich)
- System: JUWELS Supercomputer at JSC
- InfiniBand OFED version: 4.6
MPI_Allreduce
produces wrong results and crashes for small buffers of double precision on the GPU.
Files in this repository are provided to reproduce the behavior.
The attached code is based on a reproducer a user of our HPC resources sent in. The line
MPI_Allreduce(MPI_IN_PLACE, dataPtr, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
produces wrong results (one rank) or even crashes (more than one rank). In this case, dataPtr
is a pointer to an array of 10 double
s. The code works on CPU buffers using MVAPICH2-GDR and works when using GPU buffers using other CUDA-aware MPIs (OpenMPI, ParaStationMPI).
The code works on another HPC system of JSC, JURECA. The same MVAPICH version is used, but a different OFED driver (4.7).
Judging from the experiments performed, there seems to be one (several?) bug(s) regarding the handling of the buffers during reduction.
Please clone this repository. A call to make
will compile the simplest -- which is the original -- version of the application. Needed is a MVAPICH2-GDR installation and CUDA.
The experiments (described in the following) make use of pre-processor definitions. To pin the compilation strings, another script is used. Please execute
./makeExpBinaries.mk
to produce the experiment binaries. (Attention: This script is not multi-process save, i.e. don't run with -j
.)
When launching with 1 rank and calling MPI_Allreduce()
as described above, the following output is retrieved:
➜ srun -n 1 ./mpi-all-reduce.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
i: 8; data: 0.000000
i: 9; data: 0.000000
…
For some reason, only half of the entries of the dataPtr
array contain data.
If launched with 2 (or more) ranks, the application crashes:
➜ srun -n 2 ./mpi-all-reduce.exe
Running with CUDA. Number of GPUs: 4; using GPU 0 (rank = 0)
Running with CUDA. Number of GPUs: 4; using GPU 1 (rank = 1)
…
Test reduction with data on device
[jwc09n000.adm09.juwels.fzj.de:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[jwc09n000.adm09.juwels.fzj.de:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: jwc09n000: tasks 0-1: Segmentation fault
During pin-pointing the issue, we found that using 2N as the count
size, at least all entries in the reduced buffer are correct.
➜ srun -n 1 ./mpi-all-reduce--exp-1.exe
…
~ EXPERIMENT 1
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 1.000000
i: 6; data: 1.000000
i: 7; data: 1.000000
i: 8; data: 1.000000
i: 9; data: 1.000000
Unfortunately, the program crashes for 2 ranks in the same manner it did for 1 rank.
Using a second, temporary buffer on the GPU, we test if the error is related to in-place reduction. The same behavior as in the original case can be seen for 1 rank:
➜ srun -n 1 ./mpi-all-reduce--exp-2.exe
…
~ EXPERIMENT 2
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
i: 8; data: -nan
i: 9; data: 0.000000
But differently to the original case, this version does not crash for multiple ranks. Unfortunately, the result is still wrong:
➜ srun -n 2 ./mpi-all-reduce--exp-2.exe
…
~ EXPERIMENT 2
~ EXPERIMENT 2
i: 0; data: 2.000000
i: 1; data: 2.000000
i: 2; data: 2.000000
i: 3; data: 2.000000
i: 4; data: 2.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
…
Using MPI_Reduce()
instead of MPI_Allreduce()
works as intended; both for single and multiple ranks.
➜ srun -n 1 ./mpi-all-reduce--exp-3.exe
…
~ EXPERIMENT 3
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 1.000000
i: 6; data: 1.000000
…
➜ srun -n 2 ./mpi-all-reduce--exp-3.exe
…
~ EXPERIMENT 3
~ EXPERIMENT 3
i: 0; data: 2.000000
i: 0; data: 2.000000
i: 1; data: 2.000000
i: 2; data: 2.000000
i: 3; data: 2.000000
…
We saw that the behavior is different for larger buffers, as an example we chose N = 1024.
In that case, the 1-rank-run works as expected and produces correct results:
➜ srun -n 1 ./mpi-all-reduce--exp-large-N.exe
…
i: 768; data: 1.000000
i: 769; data: 1.000000
i: 770; data: 1.000000
i: 771; data: 1.000000
i: 772; data: 1.000000
…
But unfortunately, the case for multiple ranks does not produce the correct results.
➜ srun -n 2 ./mpi-all-reduce--exp-large-N.exe
…
Test reduction with data on device
i: 992; data: 256.000000
i: 993; data: 256.000000
…
On 13 January, 2.3.3 was installed on JUWELS(/JURECA). Unfortunately, the minimal reproducer is still not functioning properly. It is reporting differently from previous experiments, though.
This MVAPICH version gave a warning message relating to pointing $LD_PRELOAD
to libmpi.so
. With that, and recent gdrcopy improvement, the 2.3.3 module exports the following environment variables:
setenv("LD_PRELOAD","/gpfs/software/juwels/stages/Devel-2019a/software/MVAPICH2/2.3.3-GCC-8.3.0-GDR/lib64/libmpi.so")
setenv("MV2_GPUDIRECT_LIMIT","4194304")
setenv("MV2_PATH","/gpfs/software/juwels/stages/Devel-2019a/software/MVAPICH2/2.3.3-GCC-8.3.0-GDR")
setenv("MV2_ENABLE_AFFINITY","0")
setenv("MV2_USE_CUDA","1")
setenv("MV2_USE_GPUDIRECT_GDRCOPY","1")
Table: "Does it work?". Changed values in bold.
Experiment | 1 Rank | 2 Ranks |
---|---|---|
0 | No | Yes |
1 | Yes | Yes |
2 | No | Yes |
3 | Yes | Yes |
Large N | No | Yes |
Does still not work for one rank:
➜ srun -n 1 ./mpi-all-reduce.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
i: 8; data: 0.000000
i: 9; data: 0.000000
Contrary to before, a version launched with 2 ranks not only does not seg fault, but even produces correct results!
➜ srun -n 2 ./mpi-all-reduce.exe
…
i: 0; data: 2.000000
…
i: 5; data: 2.000000
…
i: 0; data: 2.000000
…
Both cases (1 rank, 2 ranks) work – before, only the 1 rank version worked.
➜ srun -n 1 ./mpi-all-reduce--exp-1.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 1.000000
i: 6; data: 1.000000
i: 7; data: 1.000000
i: 8; data: 1.000000
i: 9; data: 1.000000
➜ srun -n 2 ./mpi-all-reduce--exp-1.exe
…
i: 0; data: 2.000000
i: 1; data: 2.000000
Unchanged, the wrong results are produces for a run with 1 rank:
➜ srun -n 1 ./mpi-all-reduce--exp-2.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
…
For 2 ranks, the achieved result is correct now (it was wrong before):
➜ srun -n 2 ./mpi-all-reduce--exp-2.exe
…
i: 0; data: 2.000000
…
i: 5; data: 2.000000
Unchanged as before, both 1 rank and 2 rank runs work.
➜ srun -n 1 ./mpi-all-reduce--exp-3.exe
…
i: 0; data: 1.000000
…
i: 9; data: 1.000000
➜ srun -n 2 ./mpi-all-reduce--exp-3.exe
…
i: 0; data: 2.000000
…
i: 7; data: 2.000000
…
The behavior has totally changed w/r/t to the old version. Now, for 1 rank: Only half the buffer is filled with the reduced value, the rest is 0 (before: it worked as intended):
➜ srun -n 1 ./mpi-all-reduce--exp-large-N.exe
i: 541; data: 0.000000
i: 542; data: 0.000000
i: 543; data: 0.000000
i: 384; data: 1.000000
…
For 2 ranks, the result is now correct (before: all memory locations were filled, but with a wrong value):
➜ srun -n 1 ./mpi-all-reduce--exp-large-N.exe
…
i: 349; data: 2.000000
i: 350; data: 2.000000
i: 351; data: 2.000000
…