Skip to content

Instantly share code, notes, and snippets.

@lebedov
Last active December 2, 2022 04:23
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save lebedov/8514d3456a94a6c73e6d to your computer and use it in GitHub Desktop.
Save lebedov/8514d3456a94a6c73e6d to your computer and use it in GitHub Desktop.
Demo of how to pass GPU memory managed by pycuda to mpi4py.
#!/usr/bin/env python
"""
Demo of how to pass GPU memory managed by pycuda to mpi4py.
Notes
-----
This code can be used to perform peer-to-peer communication of data via
NVIDIA's GPUDirect technology if mpi4py has been built against a
CUDA-enabled MPI implementation.
"""
import atexit
import sys
# PyCUDA 2014.1 and later have built-in support for wrapping GPU memory with a
# buffer interface:
import pycuda
if pycuda.VERSION >= (2014, 1):
bufint = lambda arr: arr.gpudata.as_buffer(arr.nbytes)
else:
import cffi
ffi = cffi.FFI()
bufint = lambda arr: ffi.buffer(ffi.cast('void *', arr.ptr), arr.nbytes)
import numpy as np
from mpi4py import MPI
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
drv.init()
def dtype_to_mpi(t):
if hasattr(MPI, '_typedict'):
mpi_type = MPI._typedict[np.dtype(t).char]
elif hasattr(MPI, '__TypeDict__'):
mpi_type = MPI.__TypeDict__[np.dtype(t).char]
else:
raise ValueError('cannot convert type')
return mpi_type
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
N_gpu = drv.Device(0).count()
if N_gpu < 2:
sys.stdout.write('at least 2 GPUs required')
else:
dev = drv.Device(rank)
ctx = dev.make_context()
atexit.register(ctx.pop)
atexit.register(MPI.Finalize)
if rank == 0:
x_gpu = gpuarray.arange(100, 200, 10, dtype=np.double)
print ('before (%i): ' % rank)+str(x_gpu)
comm.Send([bufint(x_gpu), dtype_to_mpi(x_gpu.dtype)], dest=1)
print 'sent'
print ('after (%i): ' % rank)+str(x_gpu)
elif rank == 1:
x_gpu = gpuarray.zeros(10, dtype=np.double)
print ('before (%i): ' % rank)+str(x_gpu)
comm.Recv([bufint(x_gpu), dtype_to_mpi(x_gpu.dtype)], source=0)
print 'received'
print ('after (%i): ' % rank)+str(x_gpu)
@lebedov
Copy link
Author

lebedov commented Dec 24, 2014

MPI.Finalize() must be explicitly called on exit before PyCUDA cleanup to prevent errors. See this thread for more information.

@olddaos
Copy link

olddaos commented Sep 15, 2015

Thanks! But why will ffi.cast('void *', arr.ptr) work at GPU? And it seems, that to_buffer won't work, because plain mpi requires objects which support the buffer protocol in host memory. A DeviceAllocation is in device memory. You can read the as_buffer documentation and its source ( https://github.com/inducer/pycuda/blob/fde69b0502d944a2d41e1f1b2d0b78352815d487/src/cpp/cuda.hpp#L1547 ). I don't see anywhere where a device to host copy would be initiated by creating a buffer object from a DeviceAllocation. Is this example provided to demonstrate how to copy via GPU-to-host copying?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment