Skip to content

Instantly share code, notes, and snippets.

(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python ~/t.py
/home/shunting/ws/miniconda3/envs/pytorch/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Compiled module path: /tmp/torchinductor_shunting/tmpav4o5
# AOT ID: ['1_inference']
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks
0.016571
0.022813
Profiling result for a compiled module of benchmark pnasnet5large:
Chrome trace for the profile is written to /tmp/compiled_module_profile.json
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
sm90_xmma_gemm_bf16bf
0.012985
0.020379
Profiling result for a compiled module of benchmark pnasnet5large:
Chrome trace for the profile is written to /tmp/compiled_module_profile.json
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
sm90_xmma_gemm_bf16bf
# AOT ID: ['0_backward']
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks
(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --export-aot-inductor --device cuda --inference --bfloat16 --only sam_fast
loading model: 0it [00:00, ?it/s]INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
Thread 227 "pt_autograd_0" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 1007598]
0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007ffe1029189c in torch::autograd::ForwardGrad::clear() () from /home/shunting/ws/vision/torchvision/_C.so
#2 0x00007ffe102a6565 in torch::autograd::CppNode<vision::ops::(anonymous namespace)::ROIAlignFunction>::release_variables() ()
from /home/shunting/ws/vision/torchvision/_C.so
#3 0x00007fffedf68dd2 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffe
(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc)]$ time python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --disable-cudagr
aphs --device cuda --only vision_maskrcnn
loading model: 0it [00:05, ?it/s]
cuda train vision_maskrcnn
Traceback (most recent call last):
File "/home/shunting/ws/pytorch/benchmarks/dynamo/common.py", line 2335, in validate_model
self.model_iter_fn(model, example_inputs)
File "/home/shunting/ws/pytorch/benchmarks/dynamo/torchbench.py", line 466, in forward_and_backward_pass
pred = mod(*cloned_inputs)
File "/home/shunting/ws/pytorch/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl
2024-07-04T23:20:39.8000729Z loading model: 0it [00:00, ?it/s]WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:20:39.9484088Z
2024-07-04T23:20:39.9485124Z loading model: 0it [00:01, ?it/s]
2024-07-04T23:20:39.9486440Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:20:39.9487446Z cuda eval pyhpc_turbulent_kinetic_energy
2024-07-04T23:20:39.9743673Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:21:01.2023259Z ERROR:common:
2024-07-04T23:21:01.2024509Z Traceback (most recent call last):
2024-07-04T23:21:01.2025803Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/common.py", line 2642, in check_accuracy
2024-07-04T23:21:01.2027561Z new_result = optimized_model_iter_fn(model_copy, example_inputs)
(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc-sebotnet33ts_256)]$ TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only LayoutLMForMaskedLM
loading model: 0it [00:06, ?it/s]
cuda train LayoutLMForMaskedLM
AUTOTUNE addmm(512x3072, 512x768, 768x3072)
triton_mm_130 0.0209 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
triton_mm_131 0.0217 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
triton_mm_124 0.0227 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
triton_mm_129 0.0240 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLO