AndiH/Makefile Secret

## README.md

      
    Raw
  

              README.md
            
          
    MVAPICH2-GDR CUDA-Aware MPI_Allreduce Bug


Software version: MVAPICH 2.3.2-GDR
Submitter: Andreas Herten (Jülich Supercomputing Center (JSC), Forschungszentrum Jülich)
System: JUWELS Supercomputer at JSC
InfiniBand OFED version: 4.6

Short Description

MPI_Allreduce produces wrong results and crashes for small buffers of double precision on the GPU.
Files in this repository are provided to reproduce the behavior.
Description

The attached code is based on a reproducer a user of our HPC resources sent in. The line
MPI_Allreduce(MPI_IN_PLACE, dataPtr, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

produces wrong results (one rank) or even crashes (more than one rank). In this case, dataPtr is a pointer to an array of 10 doubles. The code works on CPU buffers using MVAPICH2-GDR and works when using GPU buffers using other CUDA-aware MPIs (OpenMPI, ParaStationMPI).

The code works on another HPC system of JSC, JURECA. The same MVAPICH version is used, but a different OFED driver (4.7).
Judging from the experiments performed, there seems to be one (several?) bug(s) regarding the handling of the buffers during reduction.
Experiments

Reproduction

Please clone this repository. A call to make will compile the simplest -- which is the original -- version of the application. Needed is a MVAPICH2-GDR installation and CUDA.
The experiments (described in the following) make use of pre-processor definitions. To pin the compilation strings, another script is used. Please execute
./makeExpBinaries.mk
to produce the experiment binaries. (Attention: This script is not multi-process save, i.e. don't run with -j.)
Original Version (Experiment 0)

When launching with 1 rank and calling MPI_Allreduce() as described above, the following output is retrieved:
➜ srun -n 1 ./mpi-all-reduce.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
i: 8; data: 0.000000
i: 9; data: 0.000000
…

For some reason, only half of the entries of the dataPtr array contain data.
If launched with 2 (or more) ranks, the application crashes:
➜ srun -n 2 ./mpi-all-reduce.exe
Running with CUDA. Number of GPUs: 4; using GPU 0 (rank = 0)
Running with CUDA. Number of GPUs: 4; using GPU 1 (rank = 1)
…
Test reduction with data on device
[jwc09n000.adm09.juwels.fzj.de:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[jwc09n000.adm09.juwels.fzj.de:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: jwc09n000: tasks 0-1: Segmentation fault

Experiment 1: 2N

During pin-pointing the issue, we found that using 2N as the count size, at least all entries in the reduced buffer are correct.
➜ srun -n 1 ./mpi-all-reduce--exp-1.exe
…
~ EXPERIMENT 1
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 1.000000
i: 6; data: 1.000000
i: 7; data: 1.000000
i: 8; data: 1.000000
i: 9; data: 1.000000

Unfortunately, the program crashes for 2 ranks in the same manner it did for 1 rank.
Experiment 2: No MPI_IN_PLACE

Using a second, temporary buffer on the GPU, we test if the error is related to in-place reduction. The same behavior as in the original case can be seen for 1 rank:
➜ srun -n 1 ./mpi-all-reduce--exp-2.exe
…
~ EXPERIMENT 2
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
i: 8; data: -nan
i: 9; data: 0.000000

But differently to the original case, this version does not crash for multiple ranks. Unfortunately, the result is still wrong:
➜ srun -n 2 ./mpi-all-reduce--exp-2.exe
…
~ EXPERIMENT 2
~ EXPERIMENT 2
i: 0; data: 2.000000
i: 1; data: 2.000000
i: 2; data: 2.000000
i: 3; data: 2.000000
i: 4; data: 2.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
…

Experiment 3: MPI_Reduce instead of MPI_Allreduce

Using MPI_Reduce() instead of MPI_Allreduce() works as intended; both for single and multiple ranks.
➜ srun -n 1 ./mpi-all-reduce--exp-3.exe
…
~ EXPERIMENT 3
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 1.000000
i: 6; data: 1.000000
…

➜ srun -n 2 ./mpi-all-reduce--exp-3.exe
…
~ EXPERIMENT 3
~ EXPERIMENT 3
i: 0; data: 2.000000
i: 0; data: 2.000000
i: 1; data: 2.000000
i: 2; data: 2.000000
i: 3; data: 2.000000
…

Experiment Large N: Large Buffer (1024 entries)

We saw that the behavior is different for larger buffers, as an example we chose N = 1024.
In that case, the 1-rank-run works as expected and produces correct results:
➜ srun -n 1 ./mpi-all-reduce--exp-large-N.exe
… 
i: 768; data: 1.000000
i: 769; data: 1.000000
i: 770; data: 1.000000
i: 771; data: 1.000000
i: 772; data: 1.000000
…

But unfortunately, the case for multiple ranks does not produce the correct results.
➜ srun -n 2 ./mpi-all-reduce--exp-large-N.exe
… 
Test reduction with data on device
i: 992; data: 256.000000
i: 993; data: 256.000000
…

Experiments with MVAPICH 2.3.3-GDR

On 13 January, 2.3.3 was installed on JUWELS(/JURECA). Unfortunately, the minimal reproducer is still not functioning properly. It is reporting differently from previous experiments, though.
This MVAPICH version gave a warning message relating to pointing $LD_PRELOAD to libmpi.so. With that, and recent gdrcopy improvement, the 2.3.3 module exports the following environment variables:
setenv("LD_PRELOAD","/gpfs/software/juwels/stages/Devel-2019a/software/MVAPICH2/2.3.3-GCC-8.3.0-GDR/lib64/libmpi.so")
setenv("MV2_GPUDIRECT_LIMIT","4194304")
setenv("MV2_PATH","/gpfs/software/juwels/stages/Devel-2019a/software/MVAPICH2/2.3.3-GCC-8.3.0-GDR")
setenv("MV2_ENABLE_AFFINITY","0")
setenv("MV2_USE_CUDA","1")
setenv("MV2_USE_GPUDIRECT_GDRCOPY","1")
Experiments

Table: "Does it work?". Changed values in bold.


Experiment
1 Rank
2 Ranks


0
No
Yes


1
Yes
Yes


2
No
Yes


3
Yes
Yes


Large N
No
Yes


Original Version (Experiment 0)

Does still not work for one rank:
➜ srun -n 1 ./mpi-all-reduce.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
i: 7; data: 0.000000
i: 8; data: 0.000000
i: 9; data: 0.000000

Contrary to before, a version launched with 2 ranks not only does not seg fault, but even produces correct results!
➜ srun -n 2 ./mpi-all-reduce.exe
…
i: 0; data: 2.000000
…
i: 5; data: 2.000000
…
i: 0; data: 2.000000
…

Experiment 1: 2N

Both cases (1 rank, 2 ranks) work – before, only the 1 rank version worked.
➜ srun -n 1 ./mpi-all-reduce--exp-1.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 1.000000
i: 6; data: 1.000000
i: 7; data: 1.000000
i: 8; data: 1.000000
i: 9; data: 1.000000
➜ srun -n 2 ./mpi-all-reduce--exp-1.exe
…
i: 0; data: 2.000000
i: 1; data: 2.000000

Experiment 2: No MPI_IN_PLACE

Unchanged, the wrong results are produces for a run with 1 rank:
➜ srun -n 1 ./mpi-all-reduce--exp-2.exe
…
i: 0; data: 1.000000
i: 1; data: 1.000000
i: 2; data: 1.000000
i: 3; data: 1.000000
i: 4; data: 1.000000
i: 5; data: 0.000000
i: 6; data: 0.000000
…

For 2 ranks, the achieved result is correct now (it was wrong before):
➜ srun -n 2 ./mpi-all-reduce--exp-2.exe
…
i: 0; data: 2.000000
…
i: 5; data: 2.000000

Experiment 3: MPI_Reduce instead of MPI_Allreduce

Unchanged as before, both 1 rank and 2 rank runs work.
➜ srun -n 1 ./mpi-all-reduce--exp-3.exe
…
i: 0; data: 1.000000
…
i: 9; data: 1.000000
➜ srun -n 2 ./mpi-all-reduce--exp-3.exe
…
i: 0; data: 2.000000
…
i: 7; data: 2.000000
…

Experiment Large N:  Large Buffer (1024 entries)

The behavior has totally changed w/r/t to the old version. Now, for 1 rank: Only half the buffer is filled with the reduced value, the rest is 0 (before: it worked as intended):
➜ srun -n 1 ./mpi-all-reduce--exp-large-N.exe
i: 541; data: 0.000000
i: 542; data: 0.000000
i: 543; data: 0.000000
i: 384; data: 1.000000
…

For 2 ranks, the result is now correct (before: all memory locations were filled, but with a wrong value):
➜ srun -n 1 ./mpi-all-reduce--exp-large-N.exe
…
i: 349; data: 2.000000
i: 350; data: 2.000000
i: 351; data: 2.000000
…


## makeExpBinaries.mk
#!/usr/bin/make -f

.PHONY: gen_all_executables
gen_all_executables: mpi-all-reduce--exp-1.exe mpi-all-reduce--exp-2.exe mpi-all-reduce--exp-large-N.exe mpi-all-reduce--exp-3.exe mpi-all-reduce.exe

mpi-all-reduce.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=0 -B

mpi-all-reduce--exp-1.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=1 -B
	mv mpi-all-reduce.exe $@
mpi-all-reduce--exp-2.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=2 -B
	mv mpi-all-reduce.exe $@
mpi-all-reduce--exp-3.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=3 -B
	mv mpi-all-reduce.exe $@
mpi-all-reduce--exp-large-N.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=1024 EXT_EXPERIMENT=-DEXPERIMENT=0 -B
	mv mpi-all-reduce.exe $@


## Makefile
MPICXX = mpic++
NVCC = nvcc

EXT_N ?=-DN=100
EXT_VERBOSE ?=-UVERBOSE
EXT_EXPERIMENT ?=-DEXPERIMENT=0

FLAGS =
MPIFLAGS = -Wall -I$$CUDA_HOME/include/ -L$$CUDA_HOME/lib64/ -lcudart
EXT = $(EXT_N) $(EXT_VERBOSE) $(EXT_EXPERIMENT)

.PHONY: all

all: mpi-all-reduce.exe

%.o: %.cu Makefile
	$(NVCC) $(FLAGS) $(EXT) -c -o $@ $<

%.exe: %.o
	$(MPICXX) $(FLAGS) $(MPIFLAGS) $(EXT) -o $@ $<

.PHONY: clean
clean:
	rm *.exe
	rm *.o

## mpi-all-reduce.cu
// MVAPICH2-GDR MPI_Allreduce CUDA Bug Reproducer
// Original example by M. Schmitt (PKS, MPG)
// Extended for analysis and showcase of behaviors by A. Herten (JSC, FZJ)

#include <iostream>
#include <mpi.h>
#include <vector>


#ifndef ENV_LOCAL_RANK
#define ENV_LOCAL_RANK "MPI_LOCALRANKID"
#endif


#ifndef N
#define N 10
#endif

//Error handling macro
#define CUDA_CHECK(call) \
    if((call) != cudaSuccess) { \
            cudaError_t err = cudaGetLastError(); \
                    std::cerr << "CUDA error calling \""#call"\", code is " << err << std::endl; }


__global__ void print_kernel(double * data, size_t length) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < length;
         i += blockDim.x * gridDim.x)
        printf("i: %d; data: %f\n", i, data[i]);
}

int main(int argc, char** argv) {
    if(getenv(ENV_LOCAL_RANK) != NULL) {
        char * localRankStr = NULL;
        int rank = 0, devCount = 0, dev = 0;

        if ((localRankStr = getenv(ENV_LOCAL_RANK)) != NULL)
        {
            rank = atoi(localRankStr);
        }

        CUDA_CHECK(cudaGetDeviceCount(&devCount));
        dev = rank % devCount;
        CUDA_CHECK(cudaSetDevice(dev));
        std::cout << "Running with CUDA. Number of GPUs: " << devCount << "; using GPU " << dev << " (rank = " << rank << ")" << std::endl;
    }

    MPI_Init(&argc,&argv);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    int commSize;
    MPI_Comm_size(MPI_COMM_WORLD, &commSize);

    std::vector<double> hostData(N, 1.);

    double* dataPtr; CUDA_CHECK(cudaMalloc(&dataPtr, N*sizeof(double)));
    CUDA_CHECK(cudaMemcpy(dataPtr, &hostData[0], N*sizeof(double), cudaMemcpyHostToDevice));

    if(!rank) std::cout << "\nTest reduction with data on host\n";
    MPI_Allreduce(MPI_IN_PLACE, &hostData[0], N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    MPI_Barrier(MPI_COMM_WORLD);

    if(!rank) {
        bool success = true;
        for(auto x : hostData) {
            if(abs(1.-x/commSize) > 1e-14) {
                std::cout << x << " != " << commSize << '\n';
                success = false;
            }
        }
        std::cout << (success ? "Done." : "Something went wrong.") << std::endl << std::endl;
    }

    if(!rank) std::cout << "Test reduction with data on device" << std::endl;
#if !defined(EXPERIMENT) || EXPERIMENT == 0
#if defined(VERBOSE) && (EXPERIMENT > 0)
    std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
#endif
    MPI_Allreduce(MPI_IN_PLACE, dataPtr, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
#elif EXPERIMENT == 1
#ifdef VERBOSE
    std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
#endif
    MPI_Allreduce(MPI_IN_PLACE, dataPtr, 2*N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
#elif EXPERIMENT == 2
#ifdef VERBOSE
    std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
#endif
    double* dataPtr2; CUDA_CHECK(cudaMalloc(&dataPtr2, N*sizeof(double)));
    CUDA_CHECK(cudaMemcpy(dataPtr2, &hostData[0], N*sizeof(double), cudaMemcpyHostToDevice));
    MPI_Allreduce(dataPtr, dataPtr2, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    CUDA_CHECK(cudaMemcpy(dataPtr, dataPtr2, N*sizeof(double), cudaMemcpyDeviceToDevice));
#elif EXPERIMENT == 3
#ifdef VERBOSE
    std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
#endif
    double* dataPtr2; CUDA_CHECK(cudaMalloc(&dataPtr2, N*sizeof(double)));
    CUDA_CHECK(cudaMemcpy(dataPtr2, &hostData[0], N*sizeof(double), cudaMemcpyHostToDevice));
    MPI_Reduce(dataPtr, dataPtr2, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    CUDA_CHECK(cudaMemcpy(dataPtr, dataPtr2, N*sizeof(double), cudaMemcpyDeviceToDevice));
#endif
    MPI_Barrier(MPI_COMM_WORLD);

#ifdef VERBOSE
    print_kernel<<<1, N>>>(dataPtr, N);
    cudaDeviceSynchronize();
#endif

    // Reset host array
    hostData.resize(N, 0.);
    CUDA_CHECK(cudaMemcpy(&hostData[0], dataPtr, N*sizeof(double), cudaMemcpyDeviceToHost));

    if(!rank) {
        bool success = true;
        for(auto x : hostData) {
            if(abs(1.-x/commSize) > 1e-14) {
                std::cout << x << " != " << commSize << std::endl;
                success = false;
            }
        }
        std::cout << (success ? "Done." : "Something went wrong.") << std::endl << std::endl;
    }

    CUDA_CHECK(cudaFree(dataPtr));

    MPI_Finalize();

    return 0;
}
	#!/usr/bin/make -f

	.PHONY: gen_all_executables
	gen_all_executables: mpi-all-reduce--exp-1.exe mpi-all-reduce--exp-2.exe mpi-all-reduce--exp-large-N.exe mpi-all-reduce--exp-3.exe mpi-all-reduce.exe

	mpi-all-reduce.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=0 -B

	mpi-all-reduce--exp-1.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=1 -B
	mv mpi-all-reduce.exe $@
	mpi-all-reduce--exp-2.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=2 -B
	mv mpi-all-reduce.exe $@
	mpi-all-reduce--exp-3.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=10 EXT_EXPERIMENT=-DEXPERIMENT=3 -B
	mv mpi-all-reduce.exe $@
	mpi-all-reduce--exp-large-N.exe: mpi-all-reduce.cu
	make EXT_VERBOSE=-DVERBOSE EXT_N=-DN=1024 EXT_EXPERIMENT=-DEXPERIMENT=0 -B
	mv mpi-all-reduce.exe $@
	MPICXX = mpic++
	NVCC = nvcc

	EXT_N ?=-DN=100
	EXT_VERBOSE ?=-UVERBOSE
	EXT_EXPERIMENT ?=-DEXPERIMENT=0

	FLAGS =
	MPIFLAGS = -Wall -I$$CUDA_HOME/include/ -L$$CUDA_HOME/lib64/ -lcudart
	EXT = $(EXT_N) $(EXT_VERBOSE) $(EXT_EXPERIMENT)

	.PHONY: all

	all: mpi-all-reduce.exe

	%.o: %.cu Makefile
	$(NVCC) $(FLAGS) $(EXT) -c -o $@ $<

	%.exe: %.o
	$(MPICXX) $(FLAGS) $(MPIFLAGS) $(EXT) -o $@ $<

	.PHONY: clean
	clean:
	rm *.exe
	rm *.o
	// MVAPICH2-GDR MPI_Allreduce CUDA Bug Reproducer
	// Original example by M. Schmitt (PKS, MPG)
	// Extended for analysis and showcase of behaviors by A. Herten (JSC, FZJ)

	#include <iostream>
	#include <mpi.h>
	#include <vector>


	#ifndef ENV_LOCAL_RANK
	#define ENV_LOCAL_RANK "MPI_LOCALRANKID"
	#endif


	#ifndef N
	#define N 10
	#endif

	//Error handling macro
	#define CUDA_CHECK(call) \
	if((call) != cudaSuccess) { \
	cudaError_t err = cudaGetLastError(); \
	std::cerr << "CUDA error calling \""#call"\", code is " << err << std::endl; }


	__global__ void print_kernel(double * data, size_t length) {
	for (int i = blockIdx.x * blockDim.x + threadIdx.x;
	i < length;
	i += blockDim.x * gridDim.x)
	printf("i: %d; data: %f\n", i, data[i]);
	}

	int main(int argc, char** argv) {
	if(getenv(ENV_LOCAL_RANK) != NULL) {
	char * localRankStr = NULL;
	int rank = 0, devCount = 0, dev = 0;

	if ((localRankStr = getenv(ENV_LOCAL_RANK)) != NULL)
	{
	rank = atoi(localRankStr);
	}

	CUDA_CHECK(cudaGetDeviceCount(&devCount));
	dev = rank % devCount;
	CUDA_CHECK(cudaSetDevice(dev));
	std::cout << "Running with CUDA. Number of GPUs: " << devCount << "; using GPU " << dev << " (rank = " << rank << ")" << std::endl;
	}

	MPI_Init(&argc,&argv);
	int rank;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	int commSize;
	MPI_Comm_size(MPI_COMM_WORLD, &commSize);

	std::vector<double> hostData(N, 1.);

	double* dataPtr; CUDA_CHECK(cudaMalloc(&dataPtr, N*sizeof(double)));
	CUDA_CHECK(cudaMemcpy(dataPtr, &hostData[0], N*sizeof(double), cudaMemcpyHostToDevice));

	if(!rank) std::cout << "\nTest reduction with data on host\n";
	MPI_Allreduce(MPI_IN_PLACE, &hostData[0], N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
	MPI_Barrier(MPI_COMM_WORLD);

	if(!rank) {
	bool success = true;
	for(auto x : hostData) {
	if(abs(1.-x/commSize) > 1e-14) {
	std::cout << x << " != " << commSize << '\n';
	success = false;
	}
	}
	std::cout << (success ? "Done." : "Something went wrong.") << std::endl << std::endl;
	}

	if(!rank) std::cout << "Test reduction with data on device" << std::endl;
	#if !defined(EXPERIMENT) \|\| EXPERIMENT == 0
	#if defined(VERBOSE) && (EXPERIMENT > 0)
	std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
	#endif
	MPI_Allreduce(MPI_IN_PLACE, dataPtr, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
	#elif EXPERIMENT == 1
	#ifdef VERBOSE
	std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
	#endif
	MPI_Allreduce(MPI_IN_PLACE, dataPtr, 2*N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
	#elif EXPERIMENT == 2
	#ifdef VERBOSE
	std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
	#endif
	double* dataPtr2; CUDA_CHECK(cudaMalloc(&dataPtr2, N*sizeof(double)));
	CUDA_CHECK(cudaMemcpy(dataPtr2, &hostData[0], N*sizeof(double), cudaMemcpyHostToDevice));
	MPI_Allreduce(dataPtr, dataPtr2, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
	CUDA_CHECK(cudaMemcpy(dataPtr, dataPtr2, N*sizeof(double), cudaMemcpyDeviceToDevice));
	#elif EXPERIMENT == 3
	#ifdef VERBOSE
	std::cout << "~ EXPERIMENT " << EXPERIMENT << std::endl;
	#endif
	double* dataPtr2; CUDA_CHECK(cudaMalloc(&dataPtr2, N*sizeof(double)));
	CUDA_CHECK(cudaMemcpy(dataPtr2, &hostData[0], N*sizeof(double), cudaMemcpyHostToDevice));
	MPI_Reduce(dataPtr, dataPtr2, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
	CUDA_CHECK(cudaMemcpy(dataPtr, dataPtr2, N*sizeof(double), cudaMemcpyDeviceToDevice));
	#endif
	MPI_Barrier(MPI_COMM_WORLD);

	#ifdef VERBOSE
	print_kernel<<<1, N>>>(dataPtr, N);
	cudaDeviceSynchronize();
	#endif

	// Reset host array
	hostData.resize(N, 0.);
	CUDA_CHECK(cudaMemcpy(&hostData[0], dataPtr, N*sizeof(double), cudaMemcpyDeviceToHost));

	if(!rank) {
	bool success = true;
	for(auto x : hostData) {
	if(abs(1.-x/commSize) > 1e-14) {
	std::cout << x << " != " << commSize << std::endl;
	success = false;
	}
	}
	std::cout << (success ? "Done." : "Something went wrong.") << std::endl << std::endl;
	}

	CUDA_CHECK(cudaFree(dataPtr));

	MPI_Finalize();

	return 0;
	}