shunting314

## gist:ca1fda7df8890abd31c1825e3ea3978d
(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python ~/t.py
/home/shunting/ws/miniconda3/envs/pytorch/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Compiled module path: /tmp/torchinductor_shunting/tmpav4o5

## cxbfoghtlp4guy2mhsdn6extcqyntnm47uzvdg4gay4q2eajtrus.py

# AOT ID: ['1_inference']
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks

## inline.prof
0.016571
0.022813
Profiling result for a compiled module of benchmark pnasnet5large:
Chrome trace for the profile is written to /tmp/compiled_module_profile.json
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      Input Shapes
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
sm90_xmma_gemm_bf16bf

## baseline.prof
0.012985
0.020379
Profiling result for a compiled module of benchmark pnasnet5large:
Chrome trace for the profile is written to /tmp/compiled_module_profile.json
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      Input Shapes
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
sm90_xmma_gemm_bf16bf

## cuwjxxc4jybiplk62gbzcbd7h75xjeep2fw7avvecrzn3hac6rsc.py

# AOT ID: ['0_backward']
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks

## gist:4b64393845d12aa7d5a49417afc25940
(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --export-aot-inductor --device cuda --inference --bfloat16  --only sam_fast
loading model: 0it [00:00, ?it/s]INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext

## gist:5a70df3463b2a4421b2c34aa88e78d1f

Thread 227 "pt_autograd_0" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 1007598]
0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007ffe1029189c in torch::autograd::ForwardGrad::clear() () from /home/shunting/ws/vision/torchvision/_C.so
#2  0x00007ffe102a6565 in torch::autograd::CppNode<vision::ops::(anonymous namespace)::ROIAlignFunction>::release_variables() ()
   from /home/shunting/ws/vision/torchvision/_C.so
#3  0x00007fffedf68dd2 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffe

## gist:f59aa019beda0a9c5fc4c053669563aa
(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc)]$ time python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --disable-cudagr
aphs --device cuda --only vision_maskrcnn
loading model: 0it [00:05, ?it/s]
cuda train vision_maskrcnn
Traceback (most recent call last):
  File "/home/shunting/ws/pytorch/benchmarks/dynamo/common.py", line 2335, in validate_model
    self.model_iter_fn(model, example_inputs)
  File "/home/shunting/ws/pytorch/benchmarks/dynamo/torchbench.py", line 466, in forward_and_backward_pass
    pred = mod(*cloned_inputs)
  File "/home/shunting/ws/pytorch/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl

## gist:0431331cae41d2330fa492c2f2f1eb51
2024-07-04T23:20:39.8000729Z loading model: 0it [00:00, ?it/s]WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:20:39.9484088Z
2024-07-04T23:20:39.9485124Z loading model: 0it [00:01, ?it/s]
2024-07-04T23:20:39.9486440Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:20:39.9487446Z cuda eval  pyhpc_turbulent_kinetic_energy
2024-07-04T23:20:39.9743673Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
2024-07-04T23:21:01.2023259Z ERROR:common:
2024-07-04T23:21:01.2024509Z Traceback (most recent call last):
2024-07-04T23:21:01.2025803Z   File "/var/lib/jenkins/workspace/benchmarks/dynamo/common.py", line 2642, in check_accuracy
2024-07-04T23:21:01.2027561Z     new_result = optimized_model_iter_fn(model_copy, example_inputs)

## gist:609eb9d6b60a84379f8c3b743a82d38a
(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc-sebotnet33ts_256)]$ TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only LayoutLMForMaskedLM
loading model: 0it [00:06, ?it/s]
cuda train LayoutLMForMaskedLM
AUTOTUNE addmm(512x3072, 512x768, 768x3072)
  triton_mm_130 0.0209 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_131 0.0217 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_124 0.0227 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_129 0.0240 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLO
	(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python ~/t.py
	/home/shunting/ws/miniconda3/envs/pytorch/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
	warnings.warn(
	You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
	Compiled module path: /tmp/torchinductor_shunting/tmpav4o5

	# AOT ID: ['1_inference']
	from ctypes import c_void_p, c_long
	import torch
	import math
	import random
	import os
	import tempfile
	from math import inf, nan
	from torch._inductor.hooks import run_intermediate_hooks
	0.016571
	0.022813
	Profiling result for a compiled module of benchmark pnasnet5large:
	Chrome trace for the profile is written to /tmp/compiled_module_profile.json
	------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
	Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
	------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
	sm90_xmma_gemm_bf16bf
	0.012985
	0.020379
	Profiling result for a compiled module of benchmark pnasnet5large:
	Chrome trace for the profile is written to /tmp/compiled_module_profile.json
	------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
	Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
	------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
	sm90_xmma_gemm_bf16bf

	# AOT ID: ['0_backward']
	from ctypes import c_void_p, c_long
	import torch
	import math
	import random
	import os
	import tempfile
	from math import inf, nan
	from torch._inductor.hooks import run_intermediate_hooks
	(pytorch) [shunting@devgpu002.lla3 ~/ws/pytorch (loaf)]$ python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --export-aot-inductor --device cuda --inference --bfloat16 --only sam_fast
	loading model: 0it [00:00, ?it/s]INFO:root:running build_ext
	INFO:root:running build_ext
	INFO:root:running build_ext
	INFO:root:running build_ext
	INFO:root:running build_ext
	INFO:root:running build_ext
	INFO:root:running build_ext
	INFO:root:running build_ext
	INFO:root:running build_ext

	Thread 227 "pt_autograd_0" received signal SIGSEGV, Segmentation fault.
	[Switching to LWP 1007598]
	0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
	(gdb) bt
	#0 0x00007ffff7c8cee4 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
	#1 0x00007ffe1029189c in torch::autograd::ForwardGrad::clear() () from /home/shunting/ws/vision/torchvision/_C.so
	#2 0x00007ffe102a6565 in torch::autograd::CppNode<vision::ops::(anonymous namespace)::ROIAlignFunction>::release_variables() ()
	from /home/shunting/ws/vision/torchvision/_C.so
	#3 0x00007fffedf68dd2 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffe
	(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc)]$ time python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --disable-cudagr
	aphs --device cuda --only vision_maskrcnn
	loading model: 0it [00:05, ?it/s]
	cuda train vision_maskrcnn
	Traceback (most recent call last):
	File "/home/shunting/ws/pytorch/benchmarks/dynamo/common.py", line 2335, in validate_model
	self.model_iter_fn(model, example_inputs)
	File "/home/shunting/ws/pytorch/benchmarks/dynamo/torchbench.py", line 466, in forward_and_backward_pass
	pred = mod(*cloned_inputs)
	File "/home/shunting/ws/pytorch/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl
	2024-07-04T23:20:39.8000729Z loading model: 0it [00:00, ?it/s]WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
	2024-07-04T23:20:39.9484088Z
	2024-07-04T23:20:39.9485124Z loading model: 0it [00:01, ?it/s]
	2024-07-04T23:20:39.9486440Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
	2024-07-04T23:20:39.9487446Z cuda eval pyhpc_turbulent_kinetic_energy
	2024-07-04T23:20:39.9743673Z WARNING:common:Model pyhpc_turbulent_kinetic_energy does not support bfloat16, running with amp instead
	2024-07-04T23:21:01.2023259Z ERROR:common:
	2024-07-04T23:21:01.2024509Z Traceback (most recent call last):
	2024-07-04T23:21:01.2025803Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/common.py", line 2642, in check_accuracy
	2024-07-04T23:21:01.2027561Z new_result = optimized_model_iter_fn(model_copy, example_inputs)
	(pytorch) [shunting@devgpu005.nha1 ~/ws/pytorch (acc-sebotnet33ts_256)]$ TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only LayoutLMForMaskedLM
	loading model: 0it [00:06, ?it/s]
	cuda train LayoutLMForMaskedLM
	AUTOTUNE addmm(512x3072, 512x768, 768x3072)
	triton_mm_130 0.0209 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
	triton_mm_131 0.0217 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
	triton_mm_124 0.0227 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
	triton_mm_129 0.0240 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLO