Skip to content

Instantly share code, notes, and snippets.

@shunting314
Created April 10, 2023 20:23
Show Gist options
  • Save shunting314/8243734a38b5733ea78479209c0ae893 to your computer and use it in GitHub Desktop.
Save shunting314/8243734a38b5733ea78479209c0ae893 to your computer and use it in GitHub Desktop.
/scratch/shunting/miniconda3/envs/pytorch/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
[2023-04-10 20:22:17,671] torch._inductor.utils: [WARNING] make_fallback(aten.cumprod): a decomposition exists, we should switch to it
/scratch/shunting/pytorch/torch/cuda/__init__.py:333: UserWarning: Failed to initialize NumPy: module compiled against API version 0x10 but this version of numpy is 0xe (Triggered internally at /scratch/shunting/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
self.prev_idx = torch.cuda._exchange_device(self.idx)
0.412024
STAGE:2023-04-10 20:23:05 1876117:1876117 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
0.441794
STAGE:2023-04-10 20:23:10 1876117:1876117 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-04-10 20:23:10 1876117:1876117 ActivityProfilerController.cpp:321] Completed Stage: Post Processing
Profiling result for a compiled module of benchmark mixnet_l:
Chrome trace for the profile is written to /tmp/compiled_module_profile.json
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
void at::native::(anonymous namespace)::conv_depthwi... 0.00% 0.000us 0.00% 0.000us 0.000us 1.146s 27.04% 1.146s 409.361us 2800 []
aten::_conv_depthwise2d 0.03% 746.000us 0.06% 1.771ms 17.710us 223.289ms 5.27% 223.407ms 2.234ms 100 [[128, 64, 112, 112], [64, 1, 7, 7], [], [], [], [], []]
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 213.651ms 5.04% 213.651ms 79.130us 2700 []
aten::_conv_depthwise2d 0.03% 846.000us 0.07% 1.883ms 18.830us 151.362ms 3.57% 151.710ms 1.517ms 100 [[128, 64, 112, 112], [64, 1, 5, 5], [], [], [], [], []]
aten::copy_ 0.09% 2.604ms 0.15% 4.378ms 14.593us 128.437ms 3.03% 128.931ms 429.770us 300 [[128, 64, 112, 112], [128, 64, 112, 112], []]
aten::_conv_depthwise2d 0.08% 2.263ms 0.20% 5.626ms 18.753us 128.003ms 3.02% 128.405ms 428.017us 300 [[128, 156, 14, 14], [156, 1, 9, 9], [], [], [], [], []]
triton_poi_fused__native_batch_norm_legit_functional... 0.00% 0.000us 0.00% 0.000us 0.000us 116.193ms 2.74% 116.193ms 1.162ms 100 []
void cudnn::ops::nhwcToNchwKernel<__half, __half, fl... 0.00% 0.000us 0.00% 0.000us 0.000us 114.959ms 2.71% 114.959ms 21.690us 5300 []
void cudnn::ops::nchwToNhwcKernel<__half, __half, fl... 0.00% 0.000us 0.00% 0.000us 0.000us 111.197ms 2.62% 111.197ms 19.857us 5600 []
aten::_conv_depthwise2d 0.08% 2.426ms 0.20% 5.765ms 19.217us 99.583ms 2.35% 99.621ms 332.070us 300 [[128, 120, 14, 14], [120, 1, 9, 9], [], [], [], [], []]
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Self CPU time total: 2.881s
Self CUDA time total: 4.239s
== triton_pointwise category kernels ==
Kernel Self CUDA TIME (ms) Count Percent
------------------------------------------------------------------------------------------------------------- --------------------- ------- ---------
triton_poi_fused__native_batch_norm_legit_functional_relu_threshold_backward_18_0d1d2d3d4d5d6d7d 1.16193 1.0 2.82%
triton_poi_fused_cat_13_0d1d2d 0.90115 2.0 2.19%
triton_poi_fused__native_batch_norm_legit_functional_add_clone_fill_mul_sigmoid_sub_76_0d1d2d3d4d5d6d7d 0.62698 4.0 1.52%
triton_poi_fused_cat_74_0d1d2d 0.60706 12.0 1.47%
triton_poi_fused__native_batch_norm_legit_functional_add_clone_fill_mul_sigmoid_sub_48_0d1d2d3d4d5d6d7d 0.44994 1.0 1.09%
triton_poi_fused__native_batch_norm_legit_functional_relu_6_0d1d2d3d4d5d6d 0.30071 2.0 0.73%
triton_poi_fused_mul_sigmoid_silu_88_0d1d2d3d 0.30064 3.0 0.73%
triton_poi_fused__native_batch_norm_legit_functional_relu_threshold_backward_27_0d1d2d3d4d5d6d7d 0.29111 1.0 0.71%
triton_poi_fused_cat_213_0d12d 0.25414 9.0 0.62%
triton_poi_fused_add_233_0d1d2 0.2328 58.0 0.57%
triton_poi_fused_cat_22_0d1d2d 0.22745 3.0 0.55%
triton_poi_fused__native_batch_norm_legit_functional_add_9_0d1d2d3d4d5d6d7d 0.22214 1.0 0.54%
triton_poi_fused__native_batch_norm_legit_functional_add_clone_fill_mul_sigmoid_sub_118_0d1d2d3d4d5d6d7d 0.21558 3.0 0.52%
triton_poi_fused_mul_sigmoid_silu_135_0d1d2d3d 0.19592 4.0 0.48%
triton_poi_fused__native_batch_norm_legit_functional_relu_threshold_backward_43_0d1d2d3d4d5d6d7d 0.18113 1.0 0.44%
triton_poi_fused_cat_127_0d1d2d 0.1781 12.0 0.43%
triton_poi_fused_split_with_sizes_78_0d1d2d 0.16518 3.0 0.40%
triton_poi_fused__native_batch_norm_legit_functional_add_clone_fill_mul_sigmoid_sub_157_0d1d2d3d4d5d6d7d 0.16508 3.0 0.40%
triton_poi_fused_split_with_sizes_90_0d1d2d 0.16024 3.0 0.39%
triton_poi_fused_split_with_sizes_89_0d1d2d 0.15689 3.0 0.38%
triton_poi_fused_cat_116_0d1d2d 0.15342 6.0 0.37%
triton_poi_fused_split_with_sizes_80_0d1d2d 0.15215 3.0 0.37%
triton_poi_fused_cat_36_0d1d2d 0.14546 2.0 0.35%
triton_poi_fused__native_batch_norm_legit_functional_relu_41_0d1d2d3d4d5d6d 0.14524 1.0 0.35%
triton_poi_fused_cat_165_0d1d2d 0.14513 12.0 0.35%
triton_poi_fused__native_batch_norm_legit_functional_add_clone_fill_mul_sigmoid_sub_203_0d1d2d3d4d5d6d7d 0.13141 3.0 0.32%
triton_poi_fused_cat_155_0d1d2d 0.12154 6.0 0.29%
triton_poi_fused_mul_sigmoid_silu_173_0d1d2d3d 0.11606 3.0 0.28%
triton_poi_fused__native_batch_norm_legit_functional_add_clone_fill_mul_sigmoid_sub_181_0d1d2d3d4d5d6d7d 0.11109 1.0 0.27%
triton_poi_fused_mul_sigmoid_silu_221_0d1d2d3d 0.10429 3.0 0.25%
triton_poi_fused_cat_29_0d1d2d 0.09194 4.0 0.22%
triton_poi_fused__to_copy_1_0d1d2d 0.08923 1.0 0.22%
triton_poi_fused_split_with_sizes_137_0d1d2d 0.08018 3.0 0.19%
triton_poi_fused_split_with_sizes_11_0d1d2d 0.08014 1.0 0.19%
triton_poi_fused_cat_57_0d1d2d 0.07731 4.0 0.19%
triton_poi_fused_split_with_sizes_10_0d1d2d 0.07567 1.0 0.18%
triton_poi_fused_split_with_sizes_50_0d1d2d 0.07543 1.0 0.18%
triton_poi_fused__native_batch_norm_legit_functional_add_93_0d1d2d3d4d5d6d7d 0.07448 3.0 0.18%
triton_poi_fused_mul_sigmoid_silu_66_0d1d2d3d 0.07291 1.0 0.18%
triton_poi_fused__native_batch_norm_legit_functional_add_clone_fill_mul_sigmoid_silu_sub_142_0d1d2d3d4d5d6d7d 0.07202 1.0 0.17%
triton_poi_fused__native_batch_norm_legit_functional_add_45_0d1d2d3d4d5d6d7d 0.06743 1.0 0.16%
triton_poi_fused_split_with_sizes_54_0d1d2d 0.06524 1.0 0.16%
triton_poi_fused_split_with_sizes_175_0d1d2d 0.06522 3.0 0.16%
triton_poi_fused_split_with_sizes_56_0d1d2d 0.06507 1.0 0.16%
triton_poi_fused_split_with_sizes_52_0d1d2d 0.06506 1.0 0.16%
triton_poi_fused_split_with_sizes_136_0d1d2d 0.06332 3.0 0.15%
triton_poi_fused__native_batch_norm_legit_functional_add_178_0d1d2d3d4d5d6d7d 0.05623 3.0 0.14%
triton_poi_fused_split_with_sizes_223_0d1d2d 0.0539 3.0 0.13%
triton_poi_fused_cat_92_0d1d2d 0.04831 6.0 0.12%
triton_poi_fused__native_batch_norm_legit_functional_34_0d1d2d3d4d5d6d 0.04776 1.0 0.12%
triton_poi_fused_split_with_sizes_122_0d1d2d 0.04568 3.0 0.11%
triton_poi_fused_split_with_sizes_124_0d1d2d 0.0455 3.0 0.11%
triton_poi_fused_split_with_sizes_126_0d1d2d 0.04502 3.0 0.11%
triton_poi_fused_split_with_sizes_120_0d1d2d 0.04409 3.0 0.11%
triton_poi_fused_split_with_sizes_174_0d1d2d 0.04396 3.0 0.11%
triton_poi_fused_cat_177_0d1d2d 0.04199 6.0 0.10%
triton_poi_fused_split_with_sizes_164_0d1d2d 0.03984 3.0 0.10%
triton_poi_fused_split_with_sizes_162_0d1d2d 0.03962 3.0 0.10%
triton_poi_fused__native_batch_norm_legit_functional_add_140_0d1d2d3d4d5d6d7d 0.03946 3.0 0.10%
triton_poi_fused_split_with_sizes_96_0d1d2d 0.0388 1.0 0.09%
triton_poi_fused_cat_139_0d1d2d 0.03767 6.0 0.09%
triton_poi_fused_split_with_sizes_160_0d1d2d 0.0376 3.0 0.09%
triton_poi_fused_split_with_sizes_158_0d1d2d 0.03599 3.0 0.09%
triton_poi_fused_cat_226_0d12d 0.03569 3.0 0.09%
triton_poi_fused_split_with_sizes_209_0d1d2d 0.03517 3.0 0.09%
triton_poi_fused_split_with_sizes_211_0d1d2d 0.03483 3.0 0.08%
triton_poi_fused__native_batch_norm_legit_functional_add_227_0d1d2d3d4d5d6d7d 0.03306 3.0 0.08%
triton_poi_fused_split_with_sizes_98_0d1d2d 0.03257 1.0 0.08%
triton_poi_fused_split_with_sizes_100_0d1d2d 0.03254 1.0 0.08%
triton_poi_fused_split_with_sizes_205_0d1d2d 0.0322 3.0 0.08%
triton_poi_fused_split_with_sizes_222_0d1d2d 0.03217 3.0 0.08%
triton_poi_fused_cat_212_0d1d2d 0.0317 3.0 0.08%
triton_poi_fused_split_with_sizes_207_0d1d2d 0.03111 3.0 0.08%
triton_poi_fused_cat_101_0d1d2d 0.03087 3.0 0.07%
triton_poi_fused__to_copy_224_0d1d2d 0.02979 6.0 0.07%
triton_poi_fused_cat_190_0d1d2d 0.02917 4.0 0.07%
triton_poi_fused_mul_sigmoid_silu_108_0d1d2d3d 0.02565 1.0 0.06%
triton_poi_fused__to_copy_115_0d1d2d 0.0255 6.0 0.06%
triton_poi_fused__to_copy_138_0d1d2d 0.02484 6.0 0.06%
triton_poi_fused__to_copy_91_0d1d2d 0.02478 6.0 0.06%
triton_poi_fused__to_copy_73_0d1d2d 0.02456 6.0 0.06%
triton_poi_fused__to_copy_176_0d1d2d 0.02452 6.0 0.06%
triton_poi_fused__to_copy_154_0d1d2d 0.02424 6.0 0.06%
triton_poi_fused_split_with_sizes_183_0d1d2d 0.02185 1.0 0.05%
triton_poi_fused_split_with_sizes_185_0d1d2d 0.02124 1.0 0.05%
triton_poi_fused_split_with_sizes_189_0d1d2d 0.02081 1.0 0.05%
triton_poi_fused_split_with_sizes_187_0d1d2d 0.02066 1.0 0.05%
triton_poi_fused_mul_sigmoid_silu_197_0d1d2d3d 0.019 1.0 0.05%
triton_poi_fused__native_batch_norm_legit_functional_72_0d1d2d3d4d5d6d 0.01898 1.0 0.05%
triton_poi_fused__to_copy_201_0d1d2d 0.01804 3.0 0.04%
triton_poi_fused_cat_225_0d1d2d 0.018 3.0 0.04%
triton_poi_fused__to_copy_convolution_168_0d1d2d 0.01631 4.0 0.04%
triton_poi_fused__to_copy_42_0d1d2 0.01629 4.0 0.04%
triton_poi_fused__to_copy_215_0d1d2d 0.01625 3.0 0.04%
triton_poi_fused__to_copy_convolution_133_0d1d2d 0.01607 4.0 0.04%
triton_poi_fused__to_copy_convolution_134_0d1d2d 0.01605 4.0 0.04%
triton_poi_fused__to_copy_convolution_87_0d1d2d 0.01602 4.0 0.04%
triton_poi_fused__to_copy_convolution_silu_169_0d1d2d3d 0.01602 4.0 0.04%
triton_poi_fused__to_copy_convolution_86_0d1d2d 0.016 4.0 0.04%
triton_poi_fused__to_copy_167_0d1d2d 0.01511 3.0 0.04%
triton_poi_fused__to_copy_218_0d1d2d 0.0151 3.0 0.04%
triton_poi_fused__native_batch_norm_legit_functional_153_0d1d2d3d4d5d6d 0.01416 1.0 0.03%
triton_poi_fused__to_copy_129_0d1d2d 0.01381 3.0 0.03%
triton_poi_fused__to_copy_82_0d1d2d 0.01316 3.0 0.03%
triton_poi_fused__to_copy_convolution_130_0d1d2 0.01259 3.0 0.03%
triton_poi_fused__to_copy_79_0d1d2 0.01253 3.0 0.03%
triton_poi_fused__to_copy_204_0d1d2 0.01251 3.0 0.03%
triton_poi_fused__to_copy_77_0d1d2 0.01238 3.0 0.03%
triton_poi_fused__to_copy_170_0d1d2d 0.01225 3.0 0.03%
triton_poi_fused__to_copy_159_0d1d2 0.01221 3.0 0.03%
triton_poi_fused__to_copy_206_0d1d2 0.01221 3.0 0.03%
triton_poi_fused__to_copy_208_0d1d2 0.01221 3.0 0.03%
triton_poi_fused__to_copy_convolution_216_0d1d2 0.0122 3.0 0.03%
triton_poi_fused__to_copy_convolution_silu_131_0d1d2d3d 0.01219 3.0 0.03%
triton_poi_fused__to_copy_210_0d1d2 0.01219 3.0 0.03%
triton_poi_fused__to_copy_119_0d1d2 0.01217 3.0 0.03%
triton_poi_fused__to_copy_163_0d1d2 0.01215 3.0 0.03%
triton_poi_fused__to_copy_convolution_83_0d1d2 0.01212 3.0 0.03%
triton_poi_fused__to_copy_convolution_220_0d1d2d 0.01211 3.0 0.03%
triton_poi_fused__to_copy_convolution_219_0d1d2d 0.0121 3.0 0.03%
triton_poi_fused__to_copy_125_0d1d2 0.01209 3.0 0.03%
triton_poi_fused__to_copy_132_0d1d2d 0.01207 3.0 0.03%
triton_poi_fused__to_copy_convolution_171_0d1d2d 0.01203 3.0 0.03%
triton_poi_fused__to_copy_85_0d1d2d 0.01202 3.0 0.03%
triton_poi_fused__to_copy_161_0d1d2 0.01202 3.0 0.03%
triton_poi_fused__to_copy_convolution_172_0d1d2d 0.01201 3.0 0.03%
triton_poi_fused__to_copy_convolution_silu_84_0d1d2d3d 0.012 3.0 0.03%
triton_poi_fused__to_copy_121_0d1d2 0.012 3.0 0.03%
triton_poi_fused__to_copy_123_0d1d2 0.012 3.0 0.03%
triton_poi_fused__to_copy_convolution_silu_217_0d1d2d3d 0.012 3.0 0.03%
triton_poi_fused__to_copy_t_231_0d1d2d3d 0.01196 1.0 0.03%
triton_poi_fused__native_batch_norm_legit_functional_114_0d1d2d3d4d5d6d 0.011 1.0 0.03%
triton_poi_fused__native_batch_norm_legit_functional_200_0d1d2d3d4d5d6d 0.009 1.0 0.02%
triton_poi_fused__to_copy_28_0d1d2d 0.00841 2.0 0.02%
triton_poi_fused__to_copy_12_0d1d2d 0.00817 2.0 0.02%
triton_poi_fused__to_copy_35_0d1d2d 0.00815 2.0 0.02%
triton_poi_fused__to_copy_44_0d1d2d 0.00808 2.0 0.02%
triton_poi_fused__to_copy_228_0d1d2d 0.006 1.0 0.01%
triton_poi_fused__to_copy_198_0d1d2d 0.00526 1.0 0.01%
triton_poi_fused__to_copy_148_0d1d2d 0.00509 1.0 0.01%
triton_poi_fused__to_copy_19_0d1d2d 0.00508 1.0 0.01%
triton_poi_fused__to_copy_49_0d1d2 0.00506 1.0 0.01%
triton_poi_fused__to_copy_144_0d1d2d 0.00504 1.0 0.01%
triton_poi_fused__to_copy_179_0d1d2d 0.005 1.0 0.01%
triton_poi_fused__to_copy_95_0d1d2d 0.00496 1.0 0.01%
triton_poi_fused__to_copy_109_0d1d2d 0.00463 1.0 0.01%
triton_poi_fused__to_copy_67_0d1d2d 0.00442 1.0 0.01%
triton_poi_fused__to_copy_143_0d1d2d 0.0044 1.0 0.01%
triton_poi_fused__to_copy_8_0d1d2d 0.0043 1.0 0.01%
triton_poi_fused__to_copy_46_0d1d2d 0.0043 1.0 0.01%
triton_poi_fused__to_copy_193_0d1d2d 0.00429 1.0 0.01%
triton_poi_fused__to_copy_convolution_105_0d1d2 0.00428 1.0 0.01%
triton_poi_fused__to_copy_convolution_61_0d1d2 0.00427 1.0 0.01%
triton_poi_fused__to_copy_7_0d1d2d 0.00426 1.0 0.01%
triton_poi_fused__to_copy_194_0d1d2d 0.00426 1.0 0.01%
triton_poi_fused__to_copy_94_0d1d2d 0.00424 1.0 0.01%
triton_poi_fused__to_copy_60_0d1d2d 0.00421 1.0 0.01%
triton_poi_fused__to_copy_141_0d1d2d 0.0042 1.0 0.01%
triton_poi_fused__to_copy_53_0d1d2 0.00412 1.0 0.01%
triton_poi_fused__to_copy_55_0d1d2 0.0041 1.0 0.01%
triton_poi_fused__to_copy_97_0d1d2d 0.0041 1.0 0.01%
triton_poi_fused__to_copy_104_0d1d2d 0.0041 1.0 0.01%
triton_poi_fused__to_copy_147_0d1d2d 0.0041 1.0 0.01%
triton_poi_fused__to_copy_182_0d1d2d 0.0041 1.0 0.01%
triton_poi_fused__to_copy_20_0d1d2d 0.00408 1.0 0.01%
triton_poi_fused__to_copy_convolution_196_0d1d2d 0.00408 1.0 0.01%
triton_poi_fused__to_copy_convolution_silu_62_0d1d2d3d 0.00405 1.0 0.01%
triton_poi_fused__to_copy_convolution_145_0d1d2 0.00404 1.0 0.01%
triton_poi_fused__to_copy_21_0d1d2d 0.00402 1.0 0.01%
triton_poi_fused__to_copy_51_0d1d2 0.00402 1.0 0.01%
triton_poi_fused__to_copy_99_0d1d2d 0.00402 1.0 0.01%
triton_poi_fused__to_copy_186_0d1d2d 0.00402 1.0 0.01%
triton_poi_fused__to_copy_0_0d1d2d 0.00401 1.0 0.01%
triton_poi_fused__to_copy_188_0d1d2d 0.00401 1.0 0.01%
triton_poi_fused__to_copy_232_0d1d2 0.00401 1.0 0.01%
triton_poi_fused__to_copy_63_0d1d2d 0.004 1.0 0.01%
triton_poi_fused__to_copy_convolution_64_0d1d2d 0.004 1.0 0.01%
triton_poi_fused__to_copy_convolution_65_0d1d2d 0.004 1.0 0.01%
triton_poi_fused__to_copy_convolution_silu_106_0d1d2d3d 0.004 1.0 0.01%
triton_poi_fused__to_copy_107_0d1d2d 0.004 1.0 0.01%
triton_poi_fused__to_copy_convolution_silu_146_0d1d2d3d 0.004 1.0 0.01%
triton_poi_fused__to_copy_184_0d1d2d 0.004 1.0 0.01%
triton_poi_fused__to_copy_convolution_195_0d1d2d 0.004 1.0 0.01%
Total 11.7756 28.58%
== triton_reduction category kernels ==
Kernel Self CUDA TIME (ms) Count Percent
----------------------------------------------------------------------------- --------------------- ------- ---------
triton_red_fused__native_batch_norm_legit_functional_16_0d1d2d3d4 0.90075 1.0 2.19%
triton_red_fused__native_batch_norm_legit_functional_75_0d1d2d3d4d5d6d7d8d9d 0.89278 7.0 2.17%
triton_red_fused__native_batch_norm_legit_functional_14_0d1d2d3 0.76553 1.0 1.86%
triton_red_fused__native_batch_norm_legit_functional_117_0d1d2d3d4d5d6d7d8d9d 0.42087 8.0 1.02%
triton_red_fused__native_batch_norm_legit_functional_47_0d1d2d3d4d5d6d7d8d9d 0.36698 1.0 0.89%
triton_red_fused__native_batch_norm_legit_functional_2_0d1d2d3d 0.28408 3.0 0.69%
triton_red_fused__native_batch_norm_legit_functional_202_0d1d2d3d4d5d6d7d8d9d 0.26829 6.0 0.65%
triton_red_fused__native_batch_norm_legit_functional_4_0d1d2d3d4d 0.24866 3.0 0.60%
triton_red_fused__native_batch_norm_legit_functional_156_0d1d2d3d4d5d6d7d8d9d 0.20066 6.0 0.49%
triton_red_fused__native_batch_norm_legit_functional_37_0d1d2d3d 0.17452 2.0 0.42%
triton_red_fused__native_batch_norm_legit_functional_39_0d1d2d3d4d 0.14895 2.0 0.36%
triton_red_fused__native_batch_norm_legit_functional_23_0d1d2d3d 0.13137 1.0 0.32%
triton_red_fused__native_batch_norm_legit_functional_25_0d1d2d3d4d 0.11635 1.0 0.28%
triton_red_fused__native_batch_norm_legit_functional_58_0d1d2d3d4d5d6d7d8d9d 0.10565 1.0 0.26%
triton_red_fused__native_batch_norm_legit_functional_32_0d1d2d34 0.1038 2.0 0.25%
triton_red_fused__native_batch_norm_legit_functional_70_0d1d2d34 0.096 4.0 0.23%
triton_red_fused__native_batch_norm_legit_functional_180_0d1d2d3d4d5d6d7d8d9d 0.09328 1.0 0.23%
triton_red_fused__native_batch_norm_legit_functional_30_0d1d23 0.08158 2.0 0.20%
triton_red_fused__native_batch_norm_legit_functional_199_0d1d2d3d4d5d6d7d89d 0.05204 4.0 0.13%
triton_red_fused__native_batch_norm_legit_functional_229_0d1d2d3d4d5d6d7d8d9d 0.04595 1.0 0.11%
triton_red_fused__native_batch_norm_legit_functional_68_0d1d23 0.04528 4.0 0.11%
triton_red_fused__native_batch_norm_legit_functional_149_0d1d2d3d 0.03006 4.0 0.07%
triton_red_fused__native_batch_norm_legit_functional_191_0d1d2d3d4d5d6d7d8d9d 0.02892 1.0 0.07%
triton_red_fused__native_batch_norm_legit_functional_151_0d1d2d3d4d 0.02812 4.0 0.07%
triton_red_fused__native_batch_norm_legit_functional_112_0d1d2d3d4d 0.02676 4.0 0.06%
triton_red_fused__native_batch_norm_legit_functional_110_0d1d2d3d 0.0246 4.0 0.06%
triton_red_fused__native_batch_norm_legit_functional_102_0d1d2d3d4d5d6d7d8d9d 0.02358 1.0 0.06%
Total 5.70541 13.85%
== triton_persistent_reduction category kernels ==
Kernel Self CUDA TIME (ms) Count Percent
------------------------------------------------------------------------------------------------------------ --------------------- ------- ---------
triton_per_fused__native_batch_norm_legit_functional_mean_silu_81_0d1d2d3d4d5d6d7d8d 0.51878 3.0 1.26%
triton_per_fused__native_batch_norm_legit_functional_mean_silu_128_0d1d2d3d4d5d6d7d8 0.34535 4.0 0.84%
triton_per_fused__native_batch_norm_legit_functional_mean_silu_166_0d1d2d3d4d5d6d7d8 0.20116 3.0 0.49%
triton_per_fused__native_batch_norm_legit_functional_mean_silu_214_0d1d2d3d4d5d6d7d8 0.12603 3.0 0.31%
triton_per_fused__native_batch_norm_legit_functional_mean_silu_59_0d1d2d3d4d5d6d7d8d 0.12503 1.0 0.30%
triton_per_fused__native_batch_norm_legit_functional_mean_silu_103_0d1d2d3d4d5d6d7d8 0.04802 1.0 0.12%
triton_per_fused__native_batch_norm_legit_functional_mean_relu_threshold_backward_view_230_0d1d2d3d4d5d6d7d8 0.03647 1.0 0.09%
triton_per_fused__native_batch_norm_legit_functional_mean_silu_192_0d1d2d3d4d5d6d7d8 0.02708 1.0 0.07%
triton_per_fused__native_batch_norm_legit_functional_69_0d1d2d3d45 0.01617 4.0 0.04%
triton_per_fused__native_batch_norm_legit_functional_150_0d1d2d3d4d5 0.01612 4.0 0.04%
triton_per_fused__native_batch_norm_legit_functional_152_0d1d2d3d4d5d6 0.01612 4.0 0.04%
triton_per_fused__native_batch_norm_legit_functional_111_0d1d2d3d45 0.0161 4.0 0.04%
triton_per_fused__native_batch_norm_legit_functional_71_0d1d2d3d4d56 0.01608 4.0 0.04%
triton_per_fused__native_batch_norm_legit_functional_113_0d1d2d3d4d56 0.01603 4.0 0.04%
triton_per_fused__native_batch_norm_legit_functional_5_0d1d2d3d4d5d6 0.0125 3.0 0.03%
triton_per_fused__native_batch_norm_legit_functional_3_0d1d2d3d4d5 0.01246 3.0 0.03%
triton_per_fused__native_batch_norm_legit_functional_31_0d1d2d3d45 0.00839 2.0 0.02%
triton_per_fused__native_batch_norm_legit_functional_38_0d1d2d3d45 0.00832 2.0 0.02%
triton_per_fused__native_batch_norm_legit_functional_33_0d1d2d3d4d56 0.00824 2.0 0.02%
triton_per_fused__native_batch_norm_legit_functional_40_0d1d2d3d4d56 0.00813 2.0 0.02%
triton_per_fused__native_batch_norm_legit_functional_15_0d1d2d3d4d5 0.005 1.0 0.01%
triton_per_fused__native_batch_norm_legit_functional_17_0d1d2d3d4d5d6 0.005 1.0 0.01%
triton_per_fused__native_batch_norm_legit_functional_26_0d1d2d3d4d5d6 0.005 1.0 0.01%
triton_per_fused__native_batch_norm_legit_functional_24_0d1d2d3d4d5 0.00403 1.0 0.01%
Total 1.60161 3.89%
== unknown category kernels ==
Kernel Self CUDA TIME (ms) Count Percent
------------------------------------------------------------------------------------------------------------------------ --------------------- ------- ---------
void at::native::(anonymous namespace)::conv_depthwise2d_forward_kernel<0, c10::Half, int>(at::GenericPackedTensorAccess 11.4621 28.0 27.82%
void at::native::elementwise_kernel<128, 4, at::native::gpu_kernel_impl<at::native::direct_copy_kernel_cuda(at::TensorIt 2.13651 27.0 5.19%
void cudnn::ops::nhwcToNchwKernel<__half, __half, float, true, false, (cudnnKernelDataType_t)0>(cudnn::ops::nhwc2nchw_pa 1.14959 53.0 2.79%
void cudnn::ops::nchwToNhwcKernel<__half, __half, float, false, true, (cudnnKernelDataType_t)0>(cudnn::ops::nchw2nhwc_pa 1.11197 56.0 2.70%
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nn 0.91332 4.0 2.22%
void conv2d_c1_k1_nchw_shmem_tiling_kernel<__half, float, 3, 3, 1, 1, 4, 4, 10, 40, 8, true, true>(cudnnTensorStruct, __ 0.6539 5.0 1.59%
void at::native::(anonymous namespace)::conv_depthwise2d_forward_kernel<3, c10::Half, int>(at::GenericPackedTensorAccess 0.60411 4.0 1.47%
void cutlass::Kernel<cutlass_75_wmma_tensorop_f16_s161616gemm_f16_16x16_64x2_nn_align1>(cutlass_75_wmma_tensorop_f16_s16 0.57361 2.0 1.39%
void cutlass_cudnn::Kernel<cutlass_tensorop_f16_s16816fprop_optimized_f16_128x128_32x3_nhwc>(cutlass_tensorop_f16_s16816 0.46809 16.0 1.14%
void conv2d_c1_k1_nchw_shmem_tiling_kernel<__half, float, 5, 5, 1, 1, 4, 4, 10, 40, 8, true, true>(cudnnTensorStruct, __ 0.44476 3.0 1.08%
sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize128x32x16_stage1_warpsize4x1x1_g1_ 0.31587 7.0 0.77%
sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize128x32x16_stage1_warpsize4x1x1_g1_ 0.28941 9.0 0.70%
void tensorTransformGeneric<__half, __half, float, true, false, false, (cudnnKernelDataType_t)0>(cudnnTensorTransformStr 0.28638 15.0 0.70%
ampere_fp16_scudnn_fp16_128x32_relu_small_nn_v1 0.26938 1.0 0.65%
ampere_fp16_s16816gemm_fp16_64x64_ldg8_f2f_nn 0.26615 3.0 0.65%
ampere_fp16_scudnn_fp16_128x32_relu_interior_nn_v1 0.24332 2.0 0.59%
void conv2d_c1_k1_nchw_shmem_tiling_kernel<__half, float, 5, 5, 1, 1, 7, 2, 20, 20, 8, true, true>(cudnnTensorStruct, __ 0.24088 6.0 0.58%
void conv2d_c1_k1_nchw_shmem_tiling_kernel<__half, float, 3, 3, 1, 1, 7, 2, 20, 20, 8, true, true>(cudnnTensorStruct, __ 0.23271 7.0 0.56%
sm80_xmma_fprop_implicit_gemm_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize64x128x32_stage5_warpsize2x2x1_g1_tensor16x8x16_t1 0.19418 6.0 0.47%
ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_nn 0.1762 1.0 0.43%
sm80_xmma_fprop_implicit_gemm_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize256x64x32_stage3_warpsize4x1x1_g1_tensor16x8x16_t1 0.17046 6.0 0.41%
void cutlass_cudnn::Kernel<cutlass_tensorop_f16_s16816fprop_optimized_f16_64x256_32x4_nhwc>(cutlass_tensorop_f16_s16816f 0.1696 3.0 0.41%
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816gemm_f16_64x64_64x4_nn_align2>(cutlass_80_tensorop_f16_s16816gemm_f16 0.13782 1.0 0.33%
void conv2d_c1_k1_nchw_shmem_tiling_kernel<__half, float, 5, 5, 1, 1, 2, 2, 10, 10, 8, true, true>(cudnnTensorStruct, __ 0.135 3.0 0.33%
void cutlass_cudnn::Kernel<cutlass_tensorop_f16_s16816fprop_optimized_f16_256x128_32x3_nhwc>(cutlass_tensorop_f16_s16816 0.13195 6.0 0.32%
Memset (Device) 0.11677 27.0 0.28%
void conv2d_c1_k1_nchw_shmem_tiling_kernel<__half, float, 3, 3, 1, 1, 2, 2, 10, 10, 8, true, true>(cudnnTensorStruct, __ 0.08156 3.0 0.20%
void cutlass::Kernel<cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_128x2_tn_align2>(cutlass_80_wmma_tensorop_f16_s1 0.0785 8.0 0.19%
sm80_xmma_fprop_implicit_gemm_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize64x32x64_stage5_warpsize2x2x1_g1_tensor16x8x16_t1r 0.04316 4.0 0.10%
sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize128x32x64_stage1_warpsize4x1x1_g1_ 0.03964 4.0 0.10%
sm80_xmma_fprop_implicit_gemm_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize128x128x32_stage4_warpsize2x2x1_g1_tensor16x8x16_t 0.03497 1.0 0.08%
sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f32_f32_nhwckrsc_nhwc_tilesize128x32x64_stage1_warpsize4x1x1_g1_ 0.024 3.0 0.06%
void cutlass_cudnn::Kernel<cutlass_tensorop_f16_s16816fprop_optimized_f16_64x64_32x10_nhwc>(cutlass_tensorop_f16_s16816f 0.0211 3.0 0.05%
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize32x32x64_stage6_warpsize2x2x1_tensor16x8x16_kernel 0.02107 3.0 0.05%
void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<float, 4, at::native::templates::cud 0.01888 4.22 0.05%
ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_relu_f2f_stages_64x6_tn 0.013 1.0 0.03%
void cask_cudnn::computeOffsetsKernel<false, false>(cask_cudnn::ComputeOffsetsParams) 0.01276 3.0 0.03%
void cutlass::Kernel<cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_128x1_tn_align2>(cutlass_80_wmma_tensorop_f16_s1 0.00785 1.0 0.02%
void cutlass::Kernel<cutlass_80_tensorop_s16816gemm_f16_64x64_32x6_tn_align8>(cutlass_80_tensorop_s16816gemm_f16_64x64_3 0.007 1.0 0.02%
void cutlass::Kernel<cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_32x1_tn_align2>(cutlass_80_wmma_tensorop_f16_s16 0.005 1.0 0.01%
void splitKreduce_kernel<32, 16, int, float, __half, float, __half, true, false, false>(cublasSplitKParams<float>, float 0.00402 1.0 0.01%
void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<char*, 1> >(int, at:: 0.00178 0.58 0.00%
Total 23.3083 56.57%
Percent of time when GPU is busy: 102.88%
Total wall time 41.202 ms
Output for tabulate: mixnet_l, 28.58%, 13.85%, 3.89%, 0.00%, 56.57%, 102.88%, 41.202ms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment