Skip to content

Instantly share code, notes, and snippets.

@AndiH

AndiH/Makefile Secret

Last active January 27, 2020 09:42
Show Gist options
  • Save AndiH/cf1c0ec5110170526ad345c0ce82f74b to your computer and use it in GitHub Desktop.
Save AndiH/cf1c0ec5110170526ad345c0ce82f74b to your computer and use it in GitHub Desktop.
MVAPICH2-GDR Multi-Node Bug

MVAPICH2-GDR Multi-Node MPI Bug

  • Software version: MVAPICH 2.3.3-GDR (with latest Allreduce Fix)
  • Submitter: Andreas Herten (Jülich Supercomputing Center (JSC), Forschungszentrum Jülich)
  • System: JUWELS Supercomputer at JSC
  • InfiniBand OFED version: 4.6

Update, 27 Jan 2020: See section "Env Variable: MV2_USE_RDMA_CM=0" at the end

Short Description

A simple MPI program crashes when using multiple nodes.

Files in this repository are provided to reproduce the behavior.

Description

Going forward from the previous bug fixed regarding MPI_Allreduce(), we can not launch a program using two nodes. The problem already occurs for a basic MPI skeleton, consisting of MPI_Init() and MPI_Finalize():

The attached program mpi-init.cu is used to reproduce the behavior. It basically consists of

std::cout << "Begin." << std::endl;
MPI_Init(&argc,&argv);

std::cout << "End." << std::endl;
MPI_Finalize();

Please compile it with make.

When running mpi-init.exe on one node, everything works as intended:

➜ srun --nodes 1 --ntasks-per-node 1 ./mpi-init.exe
Begin.
End.

… but when launching the executable on two nodes, a crash occurs:

➜ srun --nodes 2 --ntasks-per-node 1 ./mpi-init.exe
Begin.
Begin.
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in rdma_cm_get_local_ip:1556
[jwc09n006.adm09.juwels.fzj.de:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: jwc09n006: task 1: Segmentation fault
srun: error: jwc09n003: task 0: Terminated
srun: Force Terminated job step 2090291.5

Notes

As before, this problem only occurs on our JUWELS system. On JURECA, with OFED 4.7, the program works as expected.

We intend to upgrade the OFED stack on JUWELS to match that of JURECA in a week. If you think the problem relates to the OFED stack, we can postpone further debugging on the problem at hand until we upgrade the stack next week.

Env Variable: MV2_USE_RDMA_CM=0

Setting MV2_USE_RDMA_CM=0 does fix the issue.

➜ srun --nodes 2 ./mpi-init.exe
Begin.
Begin.
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in rdma_cm_get_local_ip:1556
[jwc09n012.adm09.juwels.fzj.de:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: jwc09n012: task 1: Segmentation fault
srun: error: jwc09n009: task 0: Terminated
srun: Force Terminated job step 2095679.0

➜ MV2_USE_RDMA_CM=0 srun --nodes 2 ./mpi-init.exe
Begin.
Begin.
End.
End.
MPICXX = mpic++
NVCC = nvcc
FLAGS =
MPIFLAGS = -Wall -I$$CUDA_HOME/include/ -L$$CUDA_HOME/lib64/ -lcudart
.PHONY: all
all: mpi-init.exe
%.o: %.cu Makefile
$(NVCC) $(FLAGS) -c -o $@ $<
%.exe: %.o
$(MPICXX) $(FLAGS) $(MPIFLAGS) -o $@ $<
.PHONY: clean
clean:
rm *.exe
rm *.o
#include <iostream>
#include <mpi.h>
int main(int argc, char** argv) {
std::cout << "Begin." << std::endl;
MPI_Init(&argc,&argv);
std::cout << "End." << std::endl;
MPI_Finalize();
return 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment