Skip to content

Instantly share code, notes, and snippets.

@sonots
Last active July 21, 2020 13:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sonots/9d4dc95716f7fafa220f85047a9438c4 to your computer and use it in GitHub Desktop.
Save sonots/9d4dc95716f7fafa220f85047a9438c4 to your computer and use it in GitHub Desktop.

It occurred when I did not wait GPU process finishes.

#include <stdio.h>
#include <cuda_runtime.h>

__global__
void my_kernel(int val, int *A, int N)
{
    int i = threadIdx.x;
    if (i < N) A[i] = val;
}

void
my_kernel_launch()
{
    int N = 9;
    int *a;
    cudaMallocManaged((void**)&a, N*sizeof(int));
    my_kernel<<<2,2>>>(2, a, N);
    // cudaDeviceSynchronize(); // <= need this!
    for (int i = 0; i < N; i++) {
        printf("%d ", a[i]);
    }
    printf("\n");
}

int main(int argc, char **argv)
{
    my_kernel_launch();
}
@trickarcher
Copy link

Thanks a lot! This saved me a bunch of time. However, I couldn't find a reason as to why I need the call. I am using Unified Memory too, but the behaviour is inconsistent. Is there a reason as to why the call is required explicitly after the Kernel launch, especially in the case of Unified Memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment