icyveins7/cupy_timing_reference.md

## cupy_timing_reference.md

      
    Raw
  

              cupy_timing_reference.md
            
          
    Purpose

This gist serves to document certain cupy calls and the wall-clock timings associated with them.
For each code snippet (usually timed simply by %%timeit), the corresponding timing output will be provided.
For most/all calls, -n 100 is added because the processing time taken increases exponentially when the GPU is flooded with too many loops from %%timeit (not sure why, but probably tries to queue too many kernel calls together).
Yes, wall-clock time is not the correct way to measure GPU processing times, but usually when control switches between the interpreter back and forth, this is the more 'reasonable' time to look at.
Windows 10, Ryzen 5 1600X, GTX 1080, cupy-cuda112 (9.2.0)

Allocation

%%timeit -n 100
b = cp.zeros(int(1e6), dtype=cp.complex64)
7.68 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
b = cp.zeros(int(1e7), dtype=cp.complex64)
7.13 µs ± 553 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Interestingly, no significant difference in allocation timings.
Zero-ing

a = cp.zeros(int(1e6), dtype=cp.complex64)
%%timeit -n 100
a[:] = 0
4.97 µs ± 531 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = cp.zeros(int(1e7), dtype=cp.complex64)
%%timeit -n 100
a[:] = 0
4.7 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Slightly faster than reallocations.
Long FFTs

b = cp.random.randn(int(1e6)) + cp.random.randn(int(1e6))*1j
%%timeit -n 100
d = cp.fft.fft(b)
40.9 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Batched, short FFTs

a = cp.random.randn(int(1e6)) + cp.random.randn(int(1e6))*1j
a = a.reshape((1000,1000))
%%timeit -n 100
d = cp.fft.fft(a, axis=1)
41.4 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
d = cp.fft.fft(a, axis=0)
41.7 µs ± 674 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not much difference in speed vs the long FFT; likely to be memory limited at this point.
Indexed Copy vs copyto()

a = cp.random.randn(int(1e6)) + cp.random.randn(int(1e6))*1j
b = cp.zeros_like(a)

%%timeit -n 100
b[:] = a
7.56 µs ± 462 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
cp.copyto(b,a)
11.8 µs ± 589 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
copyto() is actually worse!
Loops of Indexed Copies vs Loops of copyto() i.e. copying multiple groups

a = cp.random.randn(int(1e6)) + cp.random.randn(int(1e6))*1j
b = cp.zeros_like(a)

%%timeit -n 100
for i in range(1000):
    b[i*1000:(i+1)*1000] = a[i*1000:(i+1)*1000]
8.8 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
for i in range(1000):
    cp.copyto(b[i*1000:(i+1)*1000],a[i*1000:(i+1)*1000])
16 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Performance here is almost 2 orders of magnitude worse than the previous section. The interpreter loop is likely at fault here;
it takes too long to compute/parse the index limits used in the copies.
Flattening

a = cp.random.randn(int(1e6)) + cp.random.randn(int(1e6))*1j
a = a.reshape((1000,1000))

%%timeit -n 100
b = a.flatten()
22.7 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
b = a.reshape(-1)
1.25 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Invoking .reshape() is much faster!
Norms and NormSquares

b = cp.random.randn(int(1e6)) + cp.random.randn(int(1e6))*1j
a = b.reshape((1000,1000))

# Baseline
%%timeit -n 100
d = cp.linalg.norm(b)**2
104 µs ± 2.81 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Flattening
%%timeit -n 100
d = cp.linalg.norm(a.flatten())**2
130 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Reshaping
%%timeit -n 100
d = cp.linalg.norm(a.reshape(-1))**2
107 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Norm rows then sum
%%timeit -n 100
d = cp.sum(cp.linalg.norm(a, axis=1)**2)
97.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Abs, square, then sum
%%timeit -n 100
d = cp.sum(cp.abs(a)**2)
84 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is by far the most interesting result; it seems that there is some optimal 'cache' size for the norm, as the act of splitting it up into rows and then summing even outstrips the direct call on the entire array. It is likely that cupy does not optimise the norm() call for arrays of this length.
However, the abs() then sum() call is clearly the fastest.
Note that all these calls return the same result.
Multiplies and Squares

b = cp.random.randn(int(1e6)) + cp.random.randn(int(1e6))*1j
a = b.reshape((1000,1000))
c = cp.zeros_like(b)

# Baseline, square, new array
%%timeit -n 100
d = b**2
34.2 µs ± 846 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Baseline, multiply, new array
%%timeit -n 100
d = b*b
12.6 µs ± 826 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Baseline, multiply, preallocated
%%timeit -n 100
cp.multiply(b,b,out=c)
8.33 µs ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Reshaped, square, new array
%%timeit -n 100
d = a**2
33.1 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Reshaped, multiply, new array
%%timeit -n 100
d = a*a
12.5 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Reshaped, multiply, preallocated
c = cp.zeros_like(a)
%%timeit -n 100
cp.multiply(a,a,out=c)
8.33 µs ± 673 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Reshaped, multiply, in-place
%%timeit -n 100
cp.multiply(a,a,out=a)
8.39 µs ± 531 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Obvious trends are, preallocate and use the 'out' argument when performing multiplies, even for squares.