- Software version: MVAPICH 2.3.3-GDR (with latest Allreduce Fix)
- Submitter: Andreas Herten (Jülich Supercomputing Center (JSC), Forschungszentrum Jülich)
- System: JUWELS Supercomputer at JSC
- InfiniBand OFED version: 4.6
Update, 27 Jan 2020: See section "Env Variable: MV2_USE_RDMA_CM=0" at the end
A simple MPI program crashes when using multiple nodes.
Files in this repository are provided to reproduce the behavior.
Going forward from the previous bug fixed regarding MPI_Allreduce()
, we can not launch a program using two nodes. The problem already occurs for a basic MPI skeleton, consisting of MPI_Init()
and MPI_Finalize()
:
The attached program mpi-init.cu
is used to reproduce the behavior. It basically consists of
std::cout << "Begin." << std::endl;
MPI_Init(&argc,&argv);
std::cout << "End." << std::endl;
MPI_Finalize();
Please compile it with make
.
When running mpi-init.exe
on one node, everything works as intended:
➜ srun --nodes 1 --ntasks-per-node 1 ./mpi-init.exe
Begin.
End.
… but when launching the executable on two nodes, a crash occurs:
➜ srun --nodes 2 --ntasks-per-node 1 ./mpi-init.exe
Begin.
Begin.
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in rdma_cm_get_local_ip:1556
[jwc09n006.adm09.juwels.fzj.de:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: jwc09n006: task 1: Segmentation fault
srun: error: jwc09n003: task 0: Terminated
srun: Force Terminated job step 2090291.5
As before, this problem only occurs on our JUWELS system. On JURECA, with OFED 4.7, the program works as expected.
We intend to upgrade the OFED stack on JUWELS to match that of JURECA in a week. If you think the problem relates to the OFED stack, we can postpone further debugging on the problem at hand until we upgrade the stack next week.
Setting MV2_USE_RDMA_CM=0
does fix the issue.
➜ srun --nodes 2 ./mpi-init.exe
Begin.
Begin.
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in rdma_cm_get_local_ip:1556
[jwc09n012.adm09.juwels.fzj.de:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: jwc09n012: task 1: Segmentation fault
srun: error: jwc09n009: task 0: Terminated
srun: Force Terminated job step 2095679.0
➜ MV2_USE_RDMA_CM=0 srun --nodes 2 ./mpi-init.exe
Begin.
Begin.
End.
End.