Startup steps for Julia CUDA MPI application relying on ParallelStencil.jl and ImplicitGlobalGrid.jl.
GPU cluster config:
- CUDA 11.0
- CUDA-aware OpenMPI 3.0.6
- gcc 8.3
Following steps should enable a successful multi-GPU run:
using BenchmarkTools, AMDGPU | |
function diff2D_step_inbounds!(T2, T, Ci, lam, dt, _dx, _dy) | |
ix = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x | |
iy = (workgroupIdx().y - 1) * workgroupDim().y + workitemIdx().y | |
if (ix>1 && ix<size(T2,1) && iy>1 && iy<size(T2,2)) | |
@inbounds T2[ix,iy] = T[ix,iy] + dt*(Ci[ix,iy]*( | |
- ((-lam*(T[ix+1,iy] - T[ix,iy])*_dx) - (-lam*(T[ix,iy] - T[ix-1,iy])*_dx))*_dx | |
- ((-lam*(T[ix,iy+1] - T[ix,iy])*_dy) - (-lam*(T[ix,iy] - T[ix,iy-1])*_dy))*_dy )) | |
end |
using MPI | |
using CUDA | |
MPI.Init() | |
comm = MPI.COMM_WORLD | |
rank = MPI.Comm_rank(comm) | |
size = MPI.Comm_size(comm) | |
dst = mod(rank+1, size) | |
src = mod(rank-1, size) | |
println("rank=$rank, size=$size, dst=$dst, src=$src") | |
N = 4 |
using MPI | |
using CUDA | |
MPI.Init() | |
comm = MPI.COMM_WORLD | |
rank = MPI.Comm_rank(comm) | |
# select device | |
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) | |
rank_l = MPI.Comm_rank(comm_l) | |
gpu_id = CUDA.device!(rank_l) | |
# select device |
using MPI | |
using AMDGPU | |
MPI.Init() | |
comm = MPI.COMM_WORLD | |
rank = MPI.Comm_rank(comm) | |
# select device | |
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank) | |
rank_l = MPI.Comm_rank(comm_l) | |
device = AMDGPU.device_id!(rank_l+1) | |
gpu_id = AMDGPU.device_id(AMDGPU.device()) |
using MPI | |
using AMDGPU | |
MPI.Init() | |
comm = MPI.COMM_WORLD | |
rank = MPI.Comm_rank(comm) | |
size = MPI.Comm_size(comm) | |
dst = mod(rank+1, size) | |
src = mod(rank-1, size) | |
println("rank=$rank, size=$size, dst=$dst, src=$src") | |
N = 4 |
Startup steps for Julia CUDA MPI application relying on ParallelStencil.jl and ImplicitGlobalGrid.jl.
GPU cluster config:
Following steps should enable a successful multi-GPU run: