That's not "a bit". A memory allocation is likely more expensive than the compute kernel entirely, especially in the case of cudaMalloc. If we wanted to be fair, you should allocate a new GPU output array and have the kernel populate that array based on the input array. You could also try running the CPU kernel inplace by passing the input array into the
For 1d arrays you can use
This may be capturing JIT compilation on each run since it's only 1 loop, would recommend running an equal number of loops between CPU and GPU to be fair with regards to caching the JIT kernel compilation.
This is why we have pool memory allocators, yes?
I tried this briefly and wasn't able to get it to work. I also ended up timing the allocation on the CPU side and it was only 40-50ms, which is about 10% of the total compute time. I agree though that this would be useful to investigate further if someone does a real benchmark here (that is not my intention for this particular notebook).
I've rerun it several times within the same process (to avoid the JIT compilation) and didn't notice any difference.