Skip to content

Instantly share code, notes, and snippets.

(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python ~/
/home/shunting/ws/miniconda3/envs/pytorch/lib/python3.10/site-packages/huggingface_hub/ FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Compiled module path: /tmp/torchinductor_shunting/tmpav4o5
# AOT ID: ['1_inference']
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks
Profiling result for a compiled module of benchmark pnasnet5large:
Chrome trace for the profile is written to /tmp/compiled_module_profile.json
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Profiling result for a compiled module of benchmark pnasnet5large:
Chrome trace for the profile is written to /tmp/compiled_module_profile.json
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
# AOT ID: ['0_backward']
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks
(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python benchmarks/dynamo/ --ci --accuracy --timing --explain --export-aot-inductor --device cuda --inference --bfloat16 --only sam_fast
loading model: 0it [00:00, ?it/s]INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
Thread 227 "pt_autograd_0" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 1007598]
0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/
(gdb) bt
#0 0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/
#1 0x00007ffe1029189c in torch::autograd::ForwardGrad::clear() () from /home/shunting/ws/vision/torchvision/
#2 0x00007ffe102a6565 in torch::autograd::CppNode<vision::ops::(anonymous namespace)::ROIAlignFunction>::release_variables() ()
from /home/shunting/ws/vision/torchvision/
#3 0x00007fffedf68dd2 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffe
(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc)]$ time python benchmarks/dynamo/ --performance --training --amp --backend inductor --disable-cudagr
aphs --device cuda --only vision_maskrcnn
loading model: 0it [00:05, ?it/s]
cuda train vision_maskrcnn
Traceback (most recent call last):
File "/home/shunting/ws/pytorch/benchmarks/dynamo/", line 2335, in validate_model
self.model_iter_fn(model, example_inputs)
File "/home/shunting/ws/pytorch/benchmarks/dynamo/", line 466, in forward_and_backward_pass
pred = mod(*cloned_inputs)
File "/home/shunting/ws/pytorch/torch/nn/modules/", line 1716, in _wrapped_call_impl
2024-07-04T23:20:39.8000729Z loading model: 0it [00:00, ?it/s]WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:20:39.9485124Z loading model: 0it [00:01, ?it/s]
2024-07-04T23:20:39.9486440Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:20:39.9487446Z cuda eval pyhpc_turbulent_kinetic_energy
2024-07-04T23:20:39.9743673Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:21:01.2023259Z ERROR:common:
2024-07-04T23:21:01.2024509Z Traceback (most recent call last):
2024-07-04T23:21:01.2025803Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/", line 2642, in check_accuracy
2024-07-04T23:21:01.2027561Z new_result = optimized_model_iter_fn(model_copy, example_inputs)
(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc-sebotnet33ts_256)]$ TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/ --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only LayoutLMForMaskedLM
loading model: 0it [00:06, ?it/s]
cuda train LayoutLMForMaskedLM
AUTOTUNE addmm(512x3072, 512x768, 768x3072)
triton_mm_130 0.0209 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
triton_mm_131 0.0217 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
triton_mm_124 0.0227 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
triton_mm_129 0.0240 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLO