Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@mrocklin
Last active July 5, 2022 16:03
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c to your computer and use it in GitHub Desktop.
Save mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@Peacekeep3r
Copy link

Peacekeep3r commented Jul 23, 2020

this example is great and seems to be everywhere on the internet, but I think there is a bug in using cupy-arrays. For one thing, you should get identical (?) performance feeding Numpy-Arrays, since the calculations are both done on gpu anyway. More importantly, I think that using cupy-arrays causes timeit to show only the kernel invocation time - nothing has actually been calculated. Can you please check this again? This is a top Google search result for numpy gpu stencils. Try to print the output, and the calculation will actually run. I get around 160 ms!

sadly the cpu version using parallel computing is still faster even for big arrays! (60 ms). The original stencil function is just slow in numba. Better do it manually:

@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

edit: it seems I was wrong and it's mostly because of data transfer times as the cupy arrays are already on the GPU. I still think it needs a "cuda.synchronize()" for a fair comparison which increase running time quite alot.

@Karpisek
Copy link

Karpisek commented Dec 9, 2020

@

this example is great and seems to be everywhere on the internet, but I think there is a bug in using cupy-arrays. For one thing, you should get identical (?) performance feeding Numpy-Arrays, since the calculations are both done on gpu anyway. More importantly, I think that using cupy-arrays causes timeit to show only the kernel invocation time - nothing has actually been calculated. Can you please check this again? This is a top Google search result for numpy gpu stencils. Try to print the output, and the calculation will actually run. I get around 160 ms!

sadly the cpu version using parallel computing is still faster even for big arrays! (60 ms). The original stencil function is just slow in numba. Better do it manually:

@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

edit: it seems I was wrong and it's mostly because of data transfer times as the cupy arrays are already on the GPU. I still think it needs a "cuda.synchronize()" for a fair comparison which increase running time quite alot.

It beeing referenced from Dask documentation as well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment