Skip to content

Instantly share code, notes, and snippets.

@leofang
Created March 31, 2020 19:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save leofang/b466291ea822dcb6eafbf85512315a06 to your computer and use it in GitHub Desktop.
Save leofang/b466291ea822dcb6eafbf85512315a06 to your computer and use it in GitHub Desktop.
test CUB kernels
import sys
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import numpy as np
import cupy as cp
from cupyx.time import repeat
shape = (256, 512, 512)
a = cp.random.random(shape)
a_np = cp.asnumpy(a)
CUB_supported = ['sum', 'prod', 'min', 'max', 'argmin', 'argmax']
REST = ['amin', 'amax', 'nanmin', 'nanmax', 'nanargmin', 'nanargmax',
'mean', 'nanmean', 'var', 'nanvar', 'nansum', 'nanprod',
'all', 'any', 'count_nonzero']
for reduce_func in CUB_supported + REST:
for axis in [(2,), (1, 2), (0, 1, 2)]: # [(0,1,2)]:
print("testing", reduce_func, "with axis = ", axis, '...')
func = getattr(cp, reduce_func)
# get numpy answer for comparison
if reduce_func not in ('argmin', 'argmax', 'nanargmin', 'nanargmax'):
ans = getattr(np, reduce_func)(a_np, axis)
elif len(axis) == 1:
ans = getattr(np, reduce_func)(a_np, axis[0])
else:
ans = None
cp.cuda.cub_enabled = False
cp.core.cub_block_reduction_enabled = False
data = repeat(func, (a, axis), n=100)
results = [data._to_str_per_item('GPU', data.gpu_times)]
print('{:<10s} (old kernel):{}'.format(reduce_func, ' '.join(results)))
b = func(a, axis)
if reduce_func in CUB_supported:
cp.cuda.cub_enabled = True
cp.core.cub_block_reduction_enabled = False
data = repeat(func, (a, axis), n=100)
results = [data._to_str_per_item('GPU', data.gpu_times)]
print('{:<10s} (CUB device):{}'.format(reduce_func, ' '.join(results)))
c = func(a, axis)
else:
print('{:<10s} (CUB device):{}'.format(reduce_func, ' (CUB device-wide reduction not available)'))
c = None
cp.cuda.cub_enabled = False
cp.core.cub_block_reduction_enabled = True
data = repeat(func, (a, axis), n=100)
results = [data._to_str_per_item('GPU', data.gpu_times)]
print('{:<10s} (CUB blocks):{}'.format(reduce_func, ' '.join(results)))
d = func(a, axis)
try:
cp.cuda.cub_enabled = False
cp.core.cub_block_reduction_enabled = False
if ans is not None:
assert cp.allclose(ans, b)
if c is not None:
assert cp.allclose(ans, c)
assert cp.allclose(ans, d)
except AssertionError:
print("Result not match! (function: {}, axis: {})".format(reduce_func, axis), file=sys.stderr)
raise
finally:
print()
@leofang
Copy link
Author

leofang commented Mar 31, 2020

CUDA 9.2 + P100:
("old kernel" is CuPy's original implementation, "CUB device" uses cupy.cuda.cub if available, and "CUB blocks" refers to this PR.)

testing sum with axis =  (2,) ...
sum        (old kernel):    GPU: 4990.667 us   +/-224.063 (min: 4809.152 / max: 5335.168) us
sum        (CUB device):    GPU: 1006.034 us   +/- 2.244 (min: 1002.944 / max: 1017.920) us
sum        (CUB blocks):    GPU:  988.458 us   +/-14.419 (min:  968.448 / max: 1041.056) us

testing sum with axis =  (1, 2) ...
sum        (old kernel):    GPU: 1220.164 us   +/-56.665 (min: 1178.688 / max: 1323.968) us
sum        (CUB device):    GPU:  992.146 us   +/- 2.545 (min:  987.328 / max: 1004.224) us
sum        (CUB blocks):    GPU: 1231.308 us   +/- 2.996 (min: 1220.544 / max: 1247.680) us

testing sum with axis =  (0, 1, 2) ...
sum        (old kernel):    GPU:60721.186 us   +/- 7.636 (min:60699.745 / max:60738.495) us
sum        (CUB device):    GPU:  991.094 us   +/- 2.004 (min:  986.400 / max:  997.600) us
sum        (CUB blocks):    GPU: 1019.814 us   +/-15.642 (min:  997.248 / max: 1071.424) us

testing prod with axis =  (2,) ...
prod       (old kernel):    GPU: 4810.658 us   +/- 1.593 (min: 4807.840 / max: 4817.696) us
prod       (CUB device):    GPU: 1006.842 us   +/- 1.879 (min: 1003.456 / max: 1015.648) us
prod       (CUB blocks):    GPU:  989.030 us   +/-14.076 (min:  969.792 / max: 1038.432) us

testing prod with axis =  (1, 2) ...
prod       (old kernel):    GPU: 1230.642 us   +/-52.468 (min: 1196.928 / max: 1337.824) us
prod       (CUB device):    GPU: 1028.420 us   +/- 2.352 (min: 1024.704 / max: 1038.368) us
prod       (CUB blocks):    GPU: 1245.720 us   +/-16.414 (min: 1173.216 / max: 1310.048) us

testing prod with axis =  (0, 1, 2) ...
prod       (old kernel):    GPU:60720.479 us   +/- 8.317 (min:60703.232 / max:60761.536) us
prod       (CUB device):    GPU:  997.772 us   +/- 2.252 (min:  993.600 / max: 1006.720) us
prod       (CUB blocks):    GPU: 1010.511 us   +/-17.689 (min:  983.840 / max: 1079.008) us

testing min with axis =  (2,) ...
min        (old kernel):    GPU: 7278.341 us   +/- 3.611 (min: 7272.064 / max: 7297.952) us
min        (CUB device):    GPU: 1164.450 us   +/- 3.243 (min: 1158.048 / max: 1178.336) us
min        (CUB blocks):    GPU: 1049.255 us   +/- 1.673 (min: 1047.136 / max: 1061.280) us

testing min with axis =  (1, 2) ...
min        (old kernel):    GPU: 1538.007 us   +/- 2.069 (min: 1532.352 / max: 1543.072) us
min        (CUB device):    GPU:  999.870 us   +/- 3.480 (min:  995.840 / max: 1018.784) us
min        (CUB blocks):    GPU: 1530.452 us   +/-16.109 (min: 1427.808 / max: 1567.840) us

testing min with axis =  (0, 1, 2) ...
min        (old kernel):    GPU:68405.005 us   +/-27.187 (min:68339.165 / max:68464.447) us
min        (CUB device):    GPU: 1132.716 us   +/- 2.607 (min: 1127.424 / max: 1141.664) us
min        (CUB blocks):    GPU: 1147.597 us   +/- 2.324 (min: 1145.088 / max: 1162.880) us

testing max with axis =  (2,) ...
max        (old kernel):    GPU: 7278.655 us   +/- 2.921 (min: 7272.128 / max: 7285.408) us
max        (CUB device):    GPU: 1165.593 us   +/- 3.729 (min: 1159.200 / max: 1185.248) us
max        (CUB blocks):    GPU: 1075.101 us   +/-15.097 (min: 1053.984 / max: 1138.592) us

testing max with axis =  (1, 2) ...
max        (old kernel):    GPU: 1537.825 us   +/- 2.782 (min: 1532.032 / max: 1551.008) us
max        (CUB device):    GPU:  999.421 us   +/- 3.458 (min:  994.784 / max: 1020.160) us
max        (CUB blocks):    GPU: 1507.362 us   +/-21.421 (min: 1419.968 / max: 1532.896) us

testing max with axis =  (0, 1, 2) ...
max        (old kernel):    GPU:68406.172 us   +/-30.235 (min:68326.942 / max:68487.518) us
max        (CUB device):    GPU: 1132.194 us   +/- 3.130 (min: 1121.920 / max: 1140.032) us
max        (CUB blocks):    GPU: 1183.299 us   +/-16.642 (min: 1161.440 / max: 1247.680) us

testing argmin with axis =  (2,) ...
argmin     (old kernel):    GPU: 7284.553 us   +/- 2.061 (min: 7280.640 / max: 7290.560) us
argmin     (CUB device):    GPU: 7291.877 us   +/- 5.732 (min: 7285.216 / max: 7332.960) us
argmin     (CUB blocks):    GPU: 1898.855 us   +/- 3.189 (min: 1895.200 / max: 1914.400) us

testing argmin with axis =  (1, 2) ...
argmin     (old kernel):    GPU: 1663.525 us   +/- 4.502 (min: 1654.656 / max: 1683.392) us
argmin     (CUB device):    GPU: 1668.816 us   +/- 3.791 (min: 1661.312 / max: 1682.016) us
argmin     (CUB blocks):    GPU: 2583.306 us   +/- 6.405 (min: 2574.208 / max: 2602.336) us

testing argmin with axis =  (0, 1, 2) ...
argmin     (old kernel):    GPU:80232.939 us   +/-17.090 (min:80189.346 / max:80269.981) us
argmin     (CUB device):    GPU:  977.945 us   +/- 1.861 (min:  974.112 / max:  983.296) us
argmin     (CUB blocks):    GPU: 1548.706 us   +/-17.541 (min: 1527.744 / max: 1623.424) us

testing argmax with axis =  (2,) ...
argmax     (old kernel):    GPU: 7282.514 us   +/- 2.724 (min: 7277.920 / max: 7300.736) us
argmax     (CUB device):    GPU: 7289.048 us   +/- 3.116 (min: 7283.904 / max: 7304.640) us
argmax     (CUB blocks):    GPU: 1914.180 us   +/-13.747 (min: 1896.896 / max: 1961.920) us

testing argmax with axis =  (1, 2) ...
argmax     (old kernel):    GPU: 1662.225 us   +/- 3.287 (min: 1654.880 / max: 1679.520) us
argmax     (CUB device):    GPU: 1668.483 us   +/- 4.119 (min: 1661.600 / max: 1693.312) us
argmax     (CUB blocks):    GPU: 2597.427 us   +/-16.496 (min: 2574.752 / max: 2658.336) us

testing argmax with axis =  (0, 1, 2) ...
argmax     (old kernel):    GPU:80235.772 us   +/-16.621 (min:80194.496 / max:80274.529) us
argmax     (CUB device):    GPU:  977.794 us   +/- 2.587 (min:  973.888 / max:  997.376) us
argmax     (CUB blocks):    GPU: 1531.457 us   +/- 6.958 (min: 1525.536 / max: 1565.600) us

testing amin with axis =  (2,) ...
amin       (old kernel):    GPU: 7277.595 us   +/- 2.922 (min: 7271.520 / max: 7287.392) us
amin       (CUB device):    (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 1050.224 us   +/- 2.604 (min: 1048.192 / max: 1064.736) us

testing amin with axis =  (1, 2) ...
amin       (old kernel):    GPU: 1538.464 us   +/- 2.590 (min: 1534.272 / max: 1553.952) us
amin       (CUB device):    (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 1509.618 us   +/-18.288 (min: 1420.512 / max: 1527.808) us

testing amin with axis =  (0, 1, 2) ...
amin       (old kernel):    GPU:68408.069 us   +/-28.961 (min:68328.163 / max:68460.579) us
amin       (CUB device):    (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 1148.078 us   +/- 2.981 (min: 1144.416 / max: 1164.128) us

testing amax with axis =  (2,) ...
amax       (old kernel):    GPU: 7277.381 us   +/- 3.130 (min: 7271.744 / max: 7291.360) us
amax       (CUB device):    (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 1050.536 us   +/- 2.236 (min: 1048.768 / max: 1065.344) us

testing amax with axis =  (1, 2) ...
amax       (old kernel):    GPU: 1538.486 us   +/- 2.074 (min: 1533.696 / max: 1543.392) us
amax       (CUB device):    (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 1511.346 us   +/-21.175 (min: 1422.176 / max: 1653.792) us

testing amax with axis =  (0, 1, 2) ...
amax       (old kernel):    GPU:68407.898 us   +/-30.126 (min:68329.025 / max:68484.093) us
amax       (CUB device):    (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 1147.089 us   +/- 1.814 (min: 1143.968 / max: 1161.344) us

testing nanmin with axis =  (2,) ...
nanmin     (old kernel):    GPU: 6991.311 us   +/- 4.643 (min: 6978.464 / max: 7006.528) us
nanmin     (CUB device):    (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU:  994.882 us   +/- 5.229 (min:  987.584 / max: 1016.160) us

testing nanmin with axis =  (1, 2) ...
nanmin     (old kernel):    GPU: 1540.578 us   +/-23.190 (min: 1446.240 / max: 1560.320) us
nanmin     (CUB device):    (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU: 1375.904 us   +/-59.778 (min: 1280.640 / max: 1554.528) us

testing nanmin with axis =  (0, 1, 2) ...
nanmin     (old kernel):    GPU:62372.306 us   +/-14.778 (min:62346.687 / max:62481.857) us
nanmin     (CUB device):    (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU: 1080.148 us   +/- 4.415 (min: 1070.336 / max: 1097.984) us

testing nanmax with axis =  (2,) ...
nanmax     (old kernel):    GPU: 6994.590 us   +/- 3.793 (min: 6985.184 / max: 7011.488) us
nanmax     (CUB device):    (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU:  997.170 us   +/- 9.406 (min:  989.888 / max: 1075.936) us

testing nanmax with axis =  (1, 2) ...
nanmax     (old kernel):    GPU: 1541.859 us   +/-24.514 (min: 1448.064 / max: 1562.304) us
nanmax     (CUB device):    (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU: 1386.268 us   +/-57.013 (min: 1302.144 / max: 1537.888) us

testing nanmax with axis =  (0, 1, 2) ...
nanmax     (old kernel):    GPU:62371.801 us   +/- 9.200 (min:62346.497 / max:62399.902) us
nanmax     (CUB device):    (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU: 1078.648 us   +/- 6.158 (min: 1068.096 / max: 1101.536) us

testing nanargmin with axis =  (2,) ...
nanargmin  (old kernel):    GPU: 7517.221 us   +/- 7.920 (min: 7494.016 / max: 7537.056) us
nanargmin  (CUB device):    (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 1957.636 us   +/- 2.488 (min: 1954.400 / max: 1974.016) us

testing nanargmin with axis =  (1, 2) ...
nanargmin  (old kernel):    GPU: 1847.492 us   +/- 3.122 (min: 1840.896 / max: 1861.760) us
nanargmin  (CUB device):    (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 2915.788 us   +/- 6.104 (min: 2905.344 / max: 2931.424) us

testing nanargmin with axis =  (0, 1, 2) ...
nanargmin  (old kernel):    GPU:81800.252 us   +/-15.694 (min:81771.713 / max:81842.079) us
nanargmin  (CUB device):    (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 1617.921 us   +/-14.655 (min: 1598.272 / max: 1671.904) us

testing nanargmax with axis =  (2,) ...
nanargmax  (old kernel):    GPU: 7518.298 us   +/- 6.558 (min: 7495.232 / max: 7533.824) us
nanargmax  (CUB device):    (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 1975.568 us   +/-15.276 (min: 1955.552 / max: 2026.144) us

testing nanargmax with axis =  (1, 2) ...
nanargmax  (old kernel):    GPU: 1846.127 us   +/- 2.757 (min: 1839.552 / max: 1853.088) us
nanargmax  (CUB device):    (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 2916.888 us   +/- 7.969 (min: 2905.152 / max: 2948.512) us

testing nanargmax with axis =  (0, 1, 2) ...
nanargmax  (old kernel):    GPU:81800.073 us   +/-16.591 (min:81750.145 / max:81849.121) us
nanargmax  (CUB device):    (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 1621.587 us   +/-15.801 (min: 1598.848 / max: 1673.728) us

testing mean with axis =  (2,) ...
mean       (old kernel):    GPU: 4912.017 us   +/- 2.263 (min: 4909.056 / max: 4925.952) us
mean       (CUB device):    (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU:  986.478 us   +/-16.677 (min:  954.496 / max: 1053.504) us

testing mean with axis =  (1, 2) ...
mean       (old kernel):    GPU: 1213.078 us   +/-52.004 (min: 1180.704 / max: 1325.280) us
mean       (CUB device):    (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU: 1240.546 us   +/-20.976 (min: 1135.360 / max: 1293.440) us

testing mean with axis =  (0, 1, 2) ...
mean       (old kernel):    GPU:60706.537 us   +/- 7.010 (min:60680.767 / max:60722.782) us
mean       (CUB device):    (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU:  994.593 us   +/-15.427 (min:  981.888 / max: 1085.344) us

testing nanmean with axis =  (2,) ...
nanmean    (old kernel):    GPU: 5706.016 us   +/- 1.958 (min: 5703.168 / max: 5719.584) us
nanmean    (CUB device):    (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU:  975.555 us   +/-14.005 (min:  955.808 / max: 1043.840) us

testing nanmean with axis =  (1, 2) ...
nanmean    (old kernel):    GPU: 1548.820 us   +/- 3.529 (min: 1542.560 / max: 1568.832) us
nanmean    (CUB device):    (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU: 1352.323 us   +/-53.238 (min: 1171.392 / max: 1465.216) us

testing nanmean with axis =  (0, 1, 2) ...
nanmean    (old kernel):    GPU:66024.642 us   +/-34.199 (min:65944.382 / max:66108.192) us
nanmean    (CUB device):    (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU: 1041.405 us   +/-12.325 (min: 1022.176 / max: 1111.168) us

testing var with axis =  (2,) ...
var        (old kernel):    GPU: 6639.957 us   +/- 5.696 (min: 6625.760 / max: 6660.704) us
var        (CUB device):    (CUB device-wide reduction not available)
var        (CUB blocks):    GPU: 2681.771 us   +/- 4.928 (min: 2669.888 / max: 2694.816) us

testing var with axis =  (1, 2) ...
var        (old kernel):    GPU:15182.817 us   +/-125.514 (min:14972.288 / max:15646.464) us
var        (CUB device):    (CUB device-wide reduction not available)
var        (CUB blocks):    GPU:15155.650 us   +/-113.951 (min:14979.904 / max:15504.448) us

testing var with axis =  (0, 1, 2) ...
var        (old kernel):    GPU:1871212.219 us   +/-328.295 (min:1870476.318 / max:1872004.028) us
var        (CUB device):    (CUB device-wide reduction not available)
var        (CUB blocks):    GPU:68161.514 us   +/-13.974 (min:68126.114 / max:68190.620) us

testing nanvar with axis =  (2,) ...
nanvar     (old kernel):    GPU:11508.903 us   +/- 2.874 (min:11504.384 / max:11521.824) us
nanvar     (CUB device):    (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU: 4745.822 us   +/-12.190 (min: 4735.488 / max: 4795.680) us

testing nanvar with axis =  (1, 2) ...
nanvar     (old kernel):    GPU:27257.526 us   +/-19.881 (min:27204.224 / max:27351.233) us
nanvar     (CUB device):    (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU:27215.992 us   +/-41.680 (min:27105.600 / max:27278.624) us

testing nanvar with axis =  (0, 1, 2) ...
nanvar     (old kernel):    GPU:4169462.632 us   +/-66.678 (min:4169269.043 / max:4169678.223) us
nanvar     (CUB device):    (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU:185918.323 us   +/-39.052 (min:185835.587 / max:186036.774) us

testing nansum with axis =  (2,) ...
nansum     (old kernel):    GPU: 4341.788 us   +/- 1.886 (min: 4339.328 / max: 4355.648) us
nansum     (CUB device):    (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU:  952.576 us   +/- 2.080 (min:  950.464 / max:  964.800) us

testing nansum with axis =  (1, 2) ...
nansum     (old kernel):    GPU: 1244.563 us   +/- 3.199 (min: 1239.072 / max: 1256.672) us
nansum     (CUB device):    (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU: 1242.923 us   +/- 8.976 (min: 1164.128 / max: 1262.944) us

testing nansum with axis =  (0, 1, 2) ...
nansum     (old kernel):    GPU:63443.560 us   +/- 7.134 (min:63428.032 / max:63470.818) us
nansum     (CUB device):    (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU:  981.682 us   +/- 2.185 (min:  979.904 / max:  995.488) us

testing nanprod with axis =  (2,) ...
nanprod    (old kernel):    GPU: 4341.943 us   +/- 2.839 (min: 4339.360 / max: 4360.128) us
nanprod    (CUB device):    (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU:  958.980 us   +/- 9.647 (min:  950.464 / max: 1041.760) us

testing nanprod with axis =  (1, 2) ...
nanprod    (old kernel):    GPU: 1243.993 us   +/- 3.449 (min: 1238.272 / max: 1259.232) us
nanprod    (CUB device):    (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU: 1265.072 us   +/-14.266 (min: 1244.480 / max: 1312.800) us

testing nanprod with axis =  (0, 1, 2) ...
nanprod    (old kernel):    GPU:63448.173 us   +/- 6.305 (min:63426.399 / max:63468.258) us
nanprod    (CUB device):    (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU:  983.126 us   +/- 2.103 (min:  980.704 / max: 1000.384) us

testing all with axis =  (2,) ...
all        (old kernel):    GPU: 4552.869 us   +/- 2.967 (min: 4549.888 / max: 4567.616) us
all        (CUB device):    (CUB device-wide reduction not available)
all        (CUB blocks):    GPU:  970.425 us   +/-10.555 (min:  953.728 / max: 1007.424) us

testing all with axis =  (1, 2) ...
all        (old kernel):    GPU: 1214.092 us   +/- 3.496 (min: 1207.200 / max: 1232.224) us
all        (CUB device):    (CUB device-wide reduction not available)
all        (CUB blocks):    GPU: 1246.993 us   +/-14.867 (min: 1226.368 / max: 1308.960) us

testing all with axis =  (0, 1, 2) ...
all        (old kernel):    GPU:63041.968 us   +/- 3.802 (min:63033.054 / max:63055.008) us
all        (CUB device):    (CUB device-wide reduction not available)
all        (CUB blocks):    GPU:  990.109 us   +/-11.881 (min:  967.232 / max: 1045.632) us

testing any with axis =  (2,) ...
any        (old kernel):    GPU: 4547.434 us   +/- 1.077 (min: 4545.696 / max: 4552.096) us
any        (CUB device):    (CUB device-wide reduction not available)
any        (CUB blocks):    GPU:  976.362 us   +/-14.041 (min:  959.360 / max: 1032.736) us

testing any with axis =  (1, 2) ...
any        (old kernel):    GPU: 1219.624 us   +/- 3.723 (min: 1212.544 / max: 1234.624) us
any        (CUB device):    (CUB device-wide reduction not available)
any        (CUB blocks):    GPU: 1216.872 us   +/- 9.261 (min: 1127.904 / max: 1233.024) us

testing any with axis =  (0, 1, 2) ...
any        (old kernel):    GPU:63813.432 us   +/-12.887 (min:63778.816 / max:63849.182) us
any        (CUB device):    (CUB device-wide reduction not available)
any        (CUB blocks):    GPU: 1002.218 us   +/-16.331 (min:  974.336 / max: 1071.168) us

testing count_nonzero with axis =  (2,) ...
count_nonzero (old kernel):    GPU: 4328.737 us   +/- 3.623 (min: 4325.152 / max: 4345.504) us
count_nonzero (CUB device):    (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU:  982.579 us   +/-12.322 (min:  959.200 / max: 1018.944) us

testing count_nonzero with axis =  (1, 2) ...
count_nonzero (old kernel):    GPU: 1233.267 us   +/- 2.633 (min: 1228.672 / max: 1241.184) us
count_nonzero (CUB device):    (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU: 1216.972 us   +/-44.600 (min: 1120.448 / max: 1306.560) us

testing count_nonzero with axis =  (0, 1, 2) ...
count_nonzero (old kernel):    GPU:64116.835 us   +/-18.003 (min:64073.792 / max:64186.974) us
count_nonzero (CUB device):    (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU: 1014.876 us   +/-17.970 (min:  982.816 / max: 1087.008) us

@leofang
Copy link
Author

leofang commented Apr 1, 2020

CUDA 10.0 + GTX 2080 Ti:

testing sum with axis =  (2,) ...
sum        (old kernel):    GPU: 4159.497 us   +/-405.757 (min: 3860.896 / max: 5381.728) us
sum        (CUB device):    GPU: 1142.829 us   +/-35.089 (min:  986.304 / max: 1169.760) us
sum        (CUB blocks):    GPU: 1138.303 us   +/-11.385 (min: 1131.520 / max: 1217.984) us

testing sum with axis =  (1, 2) ...
sum        (old kernel):    GPU: 1083.180 us   +/-53.013 (min:  938.624 / max: 1113.952) us
sum        (CUB device):    GPU: 1117.975 us   +/-27.751 (min:  957.120 / max: 1132.416) us
sum        (CUB blocks):    GPU: 1130.267 us   +/-27.995 (min:  971.200 / max: 1150.368) us

testing sum with axis =  (0, 1, 2) ...
sum        (old kernel):    GPU:38371.774 us   +/-32.665 (min:38208.576 / max:38410.976) us
sum        (CUB device):    GPU: 1097.675 us   +/-27.765 (min:  938.336 / max: 1113.280) us
sum        (CUB blocks):    GPU: 1167.509 us   +/-26.820 (min: 1044.896 / max: 1189.824) us

testing prod with axis =  (2,) ...
prod       (old kernel):    GPU: 4030.226 us   +/-18.521 (min: 3873.664 / max: 4079.168) us
prod       (CUB device):    GPU: 1152.048 us   +/- 2.670 (min: 1146.112 / max: 1164.064) us
prod       (CUB blocks):    GPU: 1122.420 us   +/-56.175 (min:  972.736 / max: 1201.024) us

testing prod with axis =  (1, 2) ...
prod       (old kernel):    GPU: 1077.918 us   +/-57.911 (min:  937.536 / max: 1113.440) us
prod       (CUB device):    GPU: 1120.289 us   +/-23.133 (min:  958.336 / max: 1138.656) us
prod       (CUB blocks):    GPU: 1130.728 us   +/-27.959 (min:  971.008 / max: 1151.136) us

testing prod with axis =  (0, 1, 2) ...
prod       (old kernel):    GPU:38373.287 us   +/-19.281 (min:38206.017 / max:38432.896) us
prod       (CUB device):    GPU: 1099.334 us   +/-22.900 (min:  938.496 / max: 1108.512) us
prod       (CUB blocks):    GPU: 1167.291 us   +/-21.194 (min: 1040.672 / max: 1183.616) us

testing min with axis =  (2,) ...
min        (old kernel):    GPU: 4084.754 us   +/-34.300 (min: 3910.240 / max: 4121.600) us
min        (CUB device):    GPU: 2366.779 us   +/-28.071 (min: 2207.456 / max: 2381.120) us
min        (CUB blocks):    GPU: 2287.402 us   +/-27.909 (min: 2128.704 / max: 2296.224) us

testing min with axis =  (1, 2) ...
min        (old kernel):    GPU: 1209.743 us   +/-45.193 (min: 1058.816 / max: 1281.920) us
min        (CUB device):    GPU: 1188.406 us   +/-71.327 (min: 1067.072 / max: 1244.864) us
min        (CUB blocks):    GPU: 3079.536 us   +/-16.804 (min: 2913.856 / max: 3088.384) us

testing min with axis =  (0, 1, 2) ...
min        (old kernel):    GPU:65523.585 us   +/-24.032 (min:65359.619 / max:65550.018) us
min        (CUB device):    GPU: 1172.083 us   +/-15.943 (min: 1015.936 / max: 1188.672) us
min        (CUB blocks):    GPU: 2754.994 us   +/-12.025 (min: 2717.440 / max: 2769.952) us

testing max with axis =  (2,) ...
max        (old kernel):    GPU: 4098.848 us   +/-23.508 (min: 3934.336 / max: 4111.936) us
max        (CUB device):    GPU: 2370.094 us   +/-22.693 (min: 2205.504 / max: 2406.944) us
max        (CUB blocks):    GPU: 2281.976 us   +/-41.136 (min: 2128.896 / max: 2323.712) us

testing max with axis =  (1, 2) ...
max        (old kernel):    GPU: 1218.720 us   +/-24.551 (min: 1061.344 / max: 1231.136) us
max        (CUB device):    GPU: 1230.850 us   +/-16.701 (min: 1069.664 / max: 1261.152) us
max        (CUB blocks):    GPU: 3072.603 us   +/-36.640 (min: 2914.592 / max: 3090.624) us

testing max with axis =  (0, 1, 2) ...
max        (old kernel):    GPU:65528.301 us   +/-23.639 (min:65365.822 / max:65551.453) us
max        (CUB device):    GPU: 1174.942 us   +/- 7.697 (min: 1161.280 / max: 1241.696) us
max        (CUB blocks):    GPU: 2757.204 us   +/- 9.050 (min: 2717.664 / max: 2778.400) us

testing argmin with axis =  (2,) ...
argmin     (old kernel):    GPU: 4461.026 us   +/-29.950 (min: 4299.520 / max: 4502.752) us
argmin     (CUB device):    GPU: 4471.823 us   +/-31.472 (min: 4313.024 / max: 4491.456) us
argmin     (CUB blocks):    GPU: 3019.353 us   +/-23.323 (min: 2856.768 / max: 3037.984) us

testing argmin with axis =  (1, 2) ...
argmin     (old kernel):    GPU: 1560.079 us   +/-22.999 (min: 1400.320 / max: 1577.056) us
argmin     (CUB device):    GPU: 1560.169 us   +/-31.487 (min: 1402.240 / max: 1579.872) us
argmin     (CUB blocks):    GPU: 3961.720 us   +/-28.149 (min: 3802.336 / max: 3984.064) us

testing argmin with axis =  (0, 1, 2) ...
argmin     (old kernel):    GPU:88031.716 us   +/-26.109 (min:87861.633 / max:88106.689) us
argmin     (CUB device):    GPU: 1506.377 us   +/-46.813 (min: 1371.392 / max: 1550.944) us
argmin     (CUB blocks):    GPU: 3474.096 us   +/- 7.337 (min: 3459.936 / max: 3495.424) us

testing argmax with axis =  (2,) ...
argmax     (old kernel):    GPU: 4473.267 us   +/-29.312 (min: 4310.208 / max: 4507.296) us
argmax     (CUB device):    GPU: 4477.163 us   +/-25.553 (min: 4312.800 / max: 4507.744) us
argmax     (CUB blocks):    GPU: 3023.725 us   +/-18.631 (min: 2858.144 / max: 3046.112) us

testing argmax with axis =  (1, 2) ...
argmax     (old kernel):    GPU: 1562.218 us   +/-32.919 (min: 1400.192 / max: 1611.584) us
argmax     (CUB device):    GPU: 1569.789 us   +/-40.921 (min: 1402.784 / max: 1608.992) us
argmax     (CUB blocks):    GPU: 4000.629 us   +/-26.367 (min: 3835.328 / max: 4026.048) us

testing argmax with axis =  (0, 1, 2) ...
argmax     (old kernel):    GPU:88749.531 us   +/-31.298 (min:88575.394 / max:88814.140) us
argmax     (CUB device):    GPU: 1527.943 us   +/-21.475 (min: 1386.144 / max: 1559.424) us
argmax     (CUB blocks):    GPU: 3495.164 us   +/- 4.540 (min: 3487.296 / max: 3511.808) us

testing amin with axis =  (2,) ...
amin       (old kernel):    GPU: 4158.543 us   +/-38.735 (min: 3998.624 / max: 4192.576) us
amin       (CUB device):        (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 2330.791 us   +/-16.167 (min: 2172.736 / max: 2341.696) us

testing amin with axis =  (1, 2) ...
amin       (old kernel):    GPU: 1240.774 us   +/-11.165 (min: 1133.600 / max: 1249.952) us
amin       (CUB device):        (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 3121.667 us   +/-33.249 (min: 2961.120 / max: 3164.640) us

testing amin with axis =  (0, 1, 2) ...
amin       (old kernel):    GPU:66612.955 us   +/-20.001 (min:66442.848 / max:66666.557) us
amin       (CUB device):        (CUB device-wide reduction not available)
amin       (CUB blocks):    GPU: 2798.982 us   +/- 4.875 (min: 2760.896 / max: 2812.384) us

testing amax with axis =  (2,) ...
amax       (old kernel):    GPU: 4153.098 us   +/-38.910 (min: 3998.016 / max: 4178.048) us
amax       (CUB device):        (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 2327.452 us   +/-16.746 (min: 2163.904 / max: 2341.216) us

testing amax with axis =  (1, 2) ...
amax       (old kernel):    GPU: 1239.937 us   +/-16.702 (min: 1077.504 / max: 1247.616) us
amax       (CUB device):        (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 3122.654 us   +/-18.702 (min: 2962.080 / max: 3141.216) us

testing amax with axis =  (0, 1, 2) ...
amax       (old kernel):    GPU:66607.313 us   +/-29.667 (min:66440.926 / max:66660.637) us
amax       (CUB device):        (CUB device-wide reduction not available)
amax       (CUB blocks):    GPU: 2799.492 us   +/- 6.595 (min: 2760.896 / max: 2813.920) us

testing nanmin with axis =  (2,) ...
nanmin     (old kernel):    GPU: 3748.271 us   +/-53.535 (min: 3613.216 / max: 3859.488) us
nanmin     (CUB device):        (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU: 1126.258 us   +/-27.870 (min:  965.664 / max: 1138.272) us

testing nanmin with axis =  (1, 2) ...
nanmin     (old kernel):    GPU: 1120.401 us   +/-28.170 (min:  959.648 / max: 1134.272) us
nanmin     (CUB device):        (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU: 1415.120 us   +/-31.366 (min: 1260.992 / max: 1431.392) us

testing nanmin with axis =  (0, 1, 2) ...
nanmin     (old kernel):    GPU:44062.278 us   +/-35.477 (min:43886.337 / max:44105.282) us
nanmin     (CUB device):        (CUB device-wide reduction not available)
nanmin     (CUB blocks):    GPU: 1334.702 us   +/-10.854 (min: 1244.608 / max: 1351.104) us

testing nanmax with axis =  (2,) ...
nanmax     (old kernel):    GPU: 3744.612 us   +/-44.138 (min: 3632.800 / max: 3846.304) us
nanmax     (CUB device):        (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU: 1128.282 us   +/-23.219 (min:  966.560 / max: 1150.560) us

testing nanmax with axis =  (1, 2) ...
nanmax     (old kernel):    GPU: 1115.195 us   +/-41.428 (min:  957.184 / max: 1139.648) us
nanmax     (CUB device):        (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU: 1421.446 us   +/-18.968 (min: 1260.480 / max: 1475.328) us

testing nanmax with axis =  (0, 1, 2) ...
nanmax     (old kernel):    GPU:44058.180 us   +/-41.610 (min:43879.841 / max:44128.609) us
nanmax     (CUB device):        (CUB device-wide reduction not available)
nanmax     (CUB blocks):    GPU: 1325.475 us   +/-30.487 (min: 1235.840 / max: 1356.640) us

testing nanargmin with axis =  (2,) ...
nanargmin  (old kernel):    GPU: 4764.826 us   +/-32.746 (min: 4604.000 / max: 4803.488) us
nanargmin  (CUB device):        (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 3365.695 us   +/-33.226 (min: 3203.872 / max: 3419.648) us

testing nanargmin with axis =  (1, 2) ...
nanargmin  (old kernel):    GPU: 1920.273 us   +/-23.212 (min: 1758.720 / max: 1936.128) us
nanargmin  (CUB device):        (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 4288.908 us   +/-27.625 (min: 4127.520 / max: 4306.304) us

testing nanargmin with axis =  (0, 1, 2) ...
nanargmin  (old kernel):    GPU:110877.606 us   +/-23.823 (min:110721.474 / max:110942.268) us
nanargmin  (CUB device):        (CUB device-wide reduction not available)
nanargmin  (CUB blocks):    GPU: 3784.081 us   +/- 4.173 (min: 3775.072 / max: 3804.800) us

testing nanargmax with axis =  (2,) ...
nanargmax  (old kernel):    GPU: 4781.100 us   +/-43.501 (min: 4605.216 / max: 4820.672) us
nanargmax  (CUB device):        (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 3389.230 us   +/-23.426 (min: 3228.288 / max: 3422.912) us

testing nanargmax with axis =  (1, 2) ...
nanargmax  (old kernel):    GPU: 1932.439 us   +/-19.733 (min: 1774.432 / max: 1941.728) us
nanargmax  (CUB device):        (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 4313.878 us   +/-35.263 (min: 4156.384 / max: 4336.704) us

testing nanargmax with axis =  (0, 1, 2) ...
nanargmax  (old kernel):    GPU:111807.415 us   +/-25.919 (min:111639.137 / max:111855.804) us
nanargmax  (CUB device):        (CUB device-wide reduction not available)
nanargmax  (CUB blocks):    GPU: 3808.783 us   +/- 3.996 (min: 3799.424 / max: 3821.568) us

testing mean with axis =  (2,) ...
mean       (old kernel):    GPU: 4608.910 us   +/-23.902 (min: 4448.192 / max: 4629.056) us
mean       (CUB device):        (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU: 1297.813 us   +/-27.920 (min: 1138.912 / max: 1319.680) us

testing mean with axis =  (1, 2) ...
mean       (old kernel):    GPU: 1099.979 us   +/-22.353 (min:  941.952 / max: 1110.656) us
mean       (CUB device):        (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU: 1151.320 us   +/-11.933 (min: 1036.576 / max: 1167.872) us

testing mean with axis =  (0, 1, 2) ...
mean       (old kernel):    GPU:39350.566 us   +/-23.792 (min:39193.569 / max:39391.457) us
mean       (CUB device):        (CUB device-wide reduction not available)
mean       (CUB blocks):    GPU: 1199.286 us   +/-19.951 (min: 1077.056 / max: 1218.720) us

testing nanmean with axis =  (2,) ...
nanmean    (old kernel):    GPU: 3170.737 us   +/-21.619 (min: 3008.896 / max: 3204.000) us
nanmean    (CUB device):        (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU: 1530.691 us   +/-37.290 (min: 1380.032 / max: 1548.960) us

testing nanmean with axis =  (1, 2) ...
nanmean    (old kernel):    GPU: 1105.564 us   +/-35.401 (min:  948.768 / max: 1127.968) us
nanmean    (CUB device):        (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU: 1490.137 us   +/-16.748 (min: 1326.528 / max: 1507.136) us

testing nanmean with axis =  (0, 1, 2) ...
nanmean    (old kernel):    GPU:48621.619 us   +/-33.912 (min:48461.346 / max:48662.529) us
nanmean    (CUB device):        (CUB device-wide reduction not available)
nanmean    (CUB blocks):    GPU: 1518.977 us   +/-21.871 (min: 1395.968 / max: 1544.160) us

testing var with axis =  (2,) ...
var        (old kernel):    GPU: 6004.076 us   +/-69.531 (min: 5687.840 / max: 6040.224) us
var        (CUB device):        (CUB device-wide reduction not available)
var        (CUB blocks):    GPU: 2748.602 us   +/-60.475 (min: 2411.712 / max: 2832.192) us

testing var with axis =  (1, 2) ...
var        (old kernel):    GPU: 9496.720 us   +/-39.205 (min: 9334.592 / max: 9566.240) us
var        (CUB device):        (CUB device-wide reduction not available)
var        (CUB blocks):    GPU: 9543.002 us   +/-43.328 (min: 9379.264 / max: 9595.968) us

testing var with axis =  (0, 1, 2) ...
var        (old kernel):    GPU:1277100.681 us   +/-248.080 (min:1276486.694 / max:1277801.025) us
var        (CUB device):        (CUB device-wide reduction not available)
var        (CUB blocks):    GPU:46306.821 us   +/-39.049 (min:46137.119 / max:46408.001) us

testing nanvar with axis =  (2,) ...
nanvar     (old kernel):    GPU:12259.616 us   +/-31.414 (min:12096.192 / max:12284.320) us
nanvar     (CUB device):        (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU: 8146.314 us   +/-23.980 (min: 7981.248 / max: 8182.304) us

testing nanvar with axis =  (1, 2) ...
nanvar     (old kernel):    GPU:49811.653 us   +/-144.943 (min:49449.310 / max:50002.239) us
nanvar     (CUB device):        (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU:50196.915 us   +/-145.196 (min:49802.654 / max:50374.561) us

testing nanvar with axis =  (0, 1, 2) ...
nanvar     (old kernel):    GPU:3455061.255 us   +/-166.683 (min:3453450.195 / max:3455130.615) us
nanvar     (CUB device):        (CUB device-wide reduction not available)
nanvar     (CUB blocks):    GPU:384908.101 us   +/-497.698 (min:381440.948 / max:385063.110) us

testing nansum with axis =  (2,) ...
nansum     (old kernel):    GPU: 4583.659 us   +/-32.734 (min: 4422.624 / max: 4637.920) us
nansum     (CUB device):        (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU: 1493.893 us   +/-19.721 (min: 1333.056 / max: 1509.568) us

testing nansum with axis =  (1, 2) ...
nansum     (old kernel):    GPU: 1109.703 us   +/-22.828 (min:  950.496 / max: 1132.064) us
nansum     (CUB device):        (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU: 1471.623 us   +/-17.535 (min: 1309.056 / max: 1526.176) us

testing nansum with axis =  (0, 1, 2) ...
nansum     (old kernel):    GPU:45635.562 us   +/-28.171 (min:45462.914 / max:45686.687) us
nansum     (CUB device):        (CUB device-wide reduction not available)
nansum     (CUB blocks):    GPU: 1498.623 us   +/-18.228 (min: 1372.832 / max: 1516.384) us

testing nanprod with axis =  (2,) ...
nanprod    (old kernel):    GPU: 4546.808 us   +/-28.821 (min: 4381.056 / max: 4594.432) us
nanprod    (CUB device):        (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU: 1495.981 us   +/-16.283 (min: 1340.576 / max: 1527.840) us

testing nanprod with axis =  (1, 2) ...
nanprod    (old kernel):    GPU: 1100.172 us   +/-45.119 (min:  947.264 / max: 1179.648) us
nanprod    (CUB device):        (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU: 1477.601 us   +/-28.550 (min: 1315.200 / max: 1511.072) us

testing nanprod with axis =  (0, 1, 2) ...
nanprod    (old kernel):    GPU:45632.190 us   +/-30.316 (min:45465.408 / max:45693.760) us
nanprod    (CUB device):        (CUB device-wide reduction not available)
nanprod    (CUB blocks):    GPU: 1496.186 us   +/-22.149 (min: 1370.848 / max: 1511.776) us

testing all with axis =  (2,) ...
all        (old kernel):    GPU: 2050.136 us   +/- 9.295 (min: 2035.232 / max: 2063.520) us
all        (CUB device):        (CUB device-wide reduction not available)
all        (CUB blocks):    GPU: 1095.139 us   +/-28.510 (min:  934.304 / max: 1108.512) us

testing all with axis =  (1, 2) ...
all        (old kernel):    GPU: 1098.593 us   +/-21.025 (min:  938.016 / max: 1108.864) us
all        (CUB device):        (CUB device-wide reduction not available)
all        (CUB blocks):    GPU: 1085.781 us   +/-54.378 (min:  939.424 / max: 1133.504) us

testing all with axis =  (0, 1, 2) ...
all        (old kernel):    GPU:42157.684 us   +/-42.135 (min:41983.265 / max:42210.655) us
all        (CUB device):        (CUB device-wide reduction not available)
all        (CUB blocks):    GPU: 1104.489 us   +/-19.050 (min:  948.352 / max: 1146.080) us

testing any with axis =  (2,) ...
any        (old kernel):    GPU: 2059.591 us   +/-19.744 (min: 1916.896 / max: 2086.016) us
any        (CUB device):        (CUB device-wide reduction not available)
any        (CUB blocks):    GPU: 1092.373 us   +/-36.118 (min:  934.976 / max: 1103.616) us

testing any with axis =  (1, 2) ...
any        (old kernel):    GPU: 1100.307 us   +/-16.240 (min:  940.064 / max: 1105.600) us
any        (CUB device):        (CUB device-wide reduction not available)
any        (CUB blocks):    GPU: 1103.470 us   +/-30.003 (min:  944.768 / max: 1122.496) us

testing any with axis =  (0, 1, 2) ...
any        (old kernel):    GPU:42582.782 us   +/-35.315 (min:42407.135 / max:42617.569) us
any        (CUB device):        (CUB device-wide reduction not available)
any        (CUB blocks):    GPU: 1099.377 us   +/-21.385 (min:  950.144 / max: 1117.376) us

testing count_nonzero with axis =  (2,) ...
count_nonzero (old kernel):    GPU: 2139.850 us   +/-10.695 (min: 2125.664 / max: 2181.952) us
count_nonzero (CUB device):        (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU: 1098.308 us   +/-24.435 (min:  939.808 / max: 1112.288) us

testing count_nonzero with axis =  (1, 2) ...
count_nonzero (old kernel):    GPU: 1097.295 us   +/-28.293 (min:  936.000 / max: 1110.208) us
count_nonzero (CUB device):        (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU: 1104.680 us   +/-17.343 (min:  943.072 / max: 1132.128) us

testing count_nonzero with axis =  (0, 1, 2) ...
count_nonzero (old kernel):    GPU:43069.612 us   +/-28.566 (min:42907.009 / max:43110.111) us
count_nonzero (CUB device):        (CUB device-wide reduction not available)
count_nonzero (CUB blocks):    GPU: 1101.698 us   +/-15.388 (min:  958.176 / max: 1151.520) us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment