bertmaher/gist:7912b1735cf5b7c6427ef62cad4f515c Secret

## gistfile1.txt
874f1987f [frontend] Improve function def parsing with stacked decorators. (#3564)
walltime ms: 0.350
8237f1b45 [FRONTEND][NFC] Frontend cleanup (#3541)
walltime ms: 0.358
88abff689 [FRONTEND] changed hook format and added launch metadata for external tools (#3492)
walltime ms: 0.356
95d9b7f4a [FRONTEND][BACKEND] Move conversion of sw only fp8 types into the front end. (#3477)
walltime ms: 0.355
dca2d07c4 [FRONTEND] Fix arg name conflict bug (#3383)
walltime ms: 0.357
f08bdc1a8 [CACHE] Verify that when preloading a kernel its name matches what we have in specialization_data (#3395)
walltime ms: 0.357
55bb88744 [CACHE] Adding RuntimeError on signature mismatch with the cached function (#3389)
walltime ms: 0.366
d42ca115c Adding tl.const annotation to mark and validate that const tensors are not being stored to (#3360)
walltime ms: 0.269
72cba380a [AMD] Add amd f8 datatype (#3322)
walltime ms: 0.191
5a7bf72e2 [Easy][FRONTEND] Add pre run hooks to JITFunction (#3314)
walltime ms: 0.194
18cb30ca7 [easy][nfc] Consistently use sha256 for hashing (#3246)
walltime ms: 0.193
c681b5390 [RUNTIME] Include higher order function arguments in cache_key (#3137)
walltime ms: 0.190
d32880ce5 [INTERPRETER] Revive flash attention test (#3158)
walltime ms: 0.281
98b5945d2 [FRONTEND] Fix dtype serialization when preloading with dtype const (#3129)
walltime ms: 0.191
91641c329 [FRONTEND] Support preloading kernel (#3121)
walltime ms: 0.188
d883e9570 [FRONTEND] Remove specialization for divisible by 8 (#3122)
walltime ms: 0.175
b6e24b699 [FRONTEND] Allow `tl.{u}int{width}` annotations to bypass opportunistic value-based JIT-specialization (#3102)
walltime ms: 0.174
00c144eec Revert "[FRONTEND] Allow `tl.{u}int{width}` annotations to bypass opportunistic value-based JIT-specialization" (#3103)
walltime ms: 0.162
5aef9810f [FRONTEND] Allow `tl.{u}int{width}` annotations to bypass opportunistic value-based JIT-specialization
walltime ms: 0.173
dd2a32363 Remove experimental TMA and Warp specialization features (#3080)
walltime ms: 0.165
b844d519b [RUNTIME] Allow setting active driver (#2973)
walltime ms: 0.172
f3e2d8408 [FRONTEND] make CompiledKernel `metadata` a namedtuple instead of a dict, and pass it to hook in lieu of kernel object (#2929)
walltime ms: 0.169
8594268c8 [FRONTEND] Update jit.py to delay the import of InterpretedFunction to avoid being dependent on numpy by default (#2904)
walltime ms: 0.169
48034034c [FRONTEND] use standard plugin interface for CUDA (#2887)
walltime ms: 0.171
53d868113 [CLEANUP] Fix typos across the project (#2876)
walltime ms: 0.166
03ceaa64c [BACKEND] clean-up how we use LLVM (#2844)
walltime ms: 0.168
03678a3af [FRONTEND] make some cuda-specific functions more general; remove triton-translate (#2811)
walltime ms: 0.128
73a331925 [FRONTEND] split pybind11 src into multiple files (#2810)
walltime ms: 0.128
c6040bcbd When computing cache keys, be more strict about checkint the module name (#2713)
walltime ms: 0.128
755002bd3 [FRONTEND] clean-up runtime/jit.py (#2756)
walltime ms: 0.128
f2bc68ec0 Rewrite some very frequently called "try" statements (was too expensive). (#2742)
walltime ms: 0.147
72c983392 [FRONTEND] refactor `compiler` submodule (#2701)
walltime ms: 0.158
9998b1064 [RUNTIME] Ensure changed line numbers invalidate cache (#2600)
walltime ms: 0.166
df08301e7 Reformat Python code with yapf. (#2589)
walltime ms: 0.176
943330790 [FRONTEND] add do_not_specialize property back to JITFunction (#2573)
walltime ms: 0.178
12f906287 [FRONTEND] Refactor jit.py. (#2556)
walltime ms: 0.165
f88b01f55 Apply `ruff` pre-commit to python/triton/runtime. (#2558)
walltime ms: 0.102
768fc1fcd [FRONTEND] change hash to not require ptxas (#2476)
walltime ms: 0.103
29828fe49 [FRONTEND] add option to disable fp mul/add fusion (#2495)
walltime ms: 0.105
cb83b42ed [FRONTEND] using closure to create jit launcher (#2289)
walltime ms: 0.103
e686b4d6d [FRONTEND] interpreter rewrite (#2321)
walltime ms: 0.074
37f12497b [FRONTEND] Add PyTorch fp8 dtypes to Triton (#2279)
walltime ms: 0.075
9e9fbe01f [FRONTEND] Fix specialization on triton integer types (#2236)
walltime ms: 0.076
c6d33dceb [ROCM] Core Functionality for AMD (#1983)
walltime ms: 0.087
ab3e8b0da [FRONTEND] fix handling of do_not_specialize with interior constantexprs (#2188)
walltime ms: 0.088
ebfe0ffb2 [FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114)
walltime ms: 0.087
6cb67185f [FRONTEND]To use proper default num_warps and num_stages based on the device backend in JITFucntion (#2130)
walltime ms: 0.074
23dd11d47 [BACKEND] Solidify f8e4m3 (#2105)
walltime ms: 0.064
fc667d1f8 [FRONTEND] fix new absolute imports (#2072)
walltime ms: 0.064
98372f46d [FRONTEND] Remove extra calls to _get_config causing runtime overhead (#2094)
walltime ms: 0.075
a01c116f7 [FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090)
walltime ms: 0.257
776b3784c [FRONTEND] further improve version_key speed (#2073)
walltime ms: 0.248
0e11257b8 [FRONTEND] improve speed of computing version_key (#2071)
walltime ms: 0.249
30a331e62 [FRONTEND] Support jit functions without arguments (#2043)
walltime ms: 0.246
	874f1987f [frontend] Improve function def parsing with stacked decorators. (#3564)
	walltime ms: 0.350
	8237f1b45 [FRONTEND][NFC] Frontend cleanup (#3541)
	walltime ms: 0.358
	88abff689 [FRONTEND] changed hook format and added launch metadata for external tools (#3492)
	walltime ms: 0.356
	95d9b7f4a [FRONTEND][BACKEND] Move conversion of sw only fp8 types into the front end. (#3477)
	walltime ms: 0.355
	dca2d07c4 [FRONTEND] Fix arg name conflict bug (#3383)
	walltime ms: 0.357
	f08bdc1a8 [CACHE] Verify that when preloading a kernel its name matches what we have in specialization_data (#3395)
	walltime ms: 0.357
	55bb88744 [CACHE] Adding RuntimeError on signature mismatch with the cached function (#3389)
	walltime ms: 0.366
	d42ca115c Adding tl.const annotation to mark and validate that const tensors are not being stored to (#3360)
	walltime ms: 0.269
	72cba380a [AMD] Add amd f8 datatype (#3322)
	walltime ms: 0.191
	5a7bf72e2 [Easy][FRONTEND] Add pre run hooks to JITFunction (#3314)
	walltime ms: 0.194
	18cb30ca7 [easy][nfc] Consistently use sha256 for hashing (#3246)
	walltime ms: 0.193
	c681b5390 [RUNTIME] Include higher order function arguments in cache_key (#3137)
	walltime ms: 0.190
	d32880ce5 [INTERPRETER] Revive flash attention test (#3158)
	walltime ms: 0.281
	98b5945d2 [FRONTEND] Fix dtype serialization when preloading with dtype const (#3129)
	walltime ms: 0.191
	91641c329 [FRONTEND] Support preloading kernel (#3121)
	walltime ms: 0.188
	d883e9570 [FRONTEND] Remove specialization for divisible by 8 (#3122)
	walltime ms: 0.175
	b6e24b699 [FRONTEND] Allow `tl.{u}int{width}` annotations to bypass opportunistic value-based JIT-specialization (#3102)
	walltime ms: 0.174
	00c144eec Revert "[FRONTEND] Allow `tl.{u}int{width}` annotations to bypass opportunistic value-based JIT-specialization" (#3103)
	walltime ms: 0.162
	5aef9810f [FRONTEND] Allow `tl.{u}int{width}` annotations to bypass opportunistic value-based JIT-specialization
	walltime ms: 0.173
	dd2a32363 Remove experimental TMA and Warp specialization features (#3080)
	walltime ms: 0.165
	b844d519b [RUNTIME] Allow setting active driver (#2973)
	walltime ms: 0.172
	f3e2d8408 [FRONTEND] make CompiledKernel `metadata` a namedtuple instead of a dict, and pass it to hook in lieu of kernel object (#2929)
	walltime ms: 0.169
	8594268c8 [FRONTEND] Update jit.py to delay the import of InterpretedFunction to avoid being dependent on numpy by default (#2904)
	walltime ms: 0.169
	48034034c [FRONTEND] use standard plugin interface for CUDA (#2887)
	walltime ms: 0.171
	53d868113 [CLEANUP] Fix typos across the project (#2876)
	walltime ms: 0.166
	03ceaa64c [BACKEND] clean-up how we use LLVM (#2844)
	walltime ms: 0.168
	03678a3af [FRONTEND] make some cuda-specific functions more general; remove triton-translate (#2811)
	walltime ms: 0.128
	73a331925 [FRONTEND] split pybind11 src into multiple files (#2810)
	walltime ms: 0.128
	c6040bcbd When computing cache keys, be more strict about checkint the module name (#2713)
	walltime ms: 0.128
	755002bd3 [FRONTEND] clean-up runtime/jit.py (#2756)
	walltime ms: 0.128
	f2bc68ec0 Rewrite some very frequently called "try" statements (was too expensive). (#2742)
	walltime ms: 0.147
	72c983392 [FRONTEND] refactor `compiler` submodule (#2701)
	walltime ms: 0.158
	9998b1064 [RUNTIME] Ensure changed line numbers invalidate cache (#2600)
	walltime ms: 0.166
	df08301e7 Reformat Python code with yapf. (#2589)
	walltime ms: 0.176
	943330790 [FRONTEND] add do_not_specialize property back to JITFunction (#2573)
	walltime ms: 0.178
	12f906287 [FRONTEND] Refactor jit.py. (#2556)
	walltime ms: 0.165
	f88b01f55 Apply `ruff` pre-commit to python/triton/runtime. (#2558)
	walltime ms: 0.102
	768fc1fcd [FRONTEND] change hash to not require ptxas (#2476)
	walltime ms: 0.103
	29828fe49 [FRONTEND] add option to disable fp mul/add fusion (#2495)
	walltime ms: 0.105
	cb83b42ed [FRONTEND] using closure to create jit launcher (#2289)
	walltime ms: 0.103
	e686b4d6d [FRONTEND] interpreter rewrite (#2321)
	walltime ms: 0.074
	37f12497b [FRONTEND] Add PyTorch fp8 dtypes to Triton (#2279)
	walltime ms: 0.075
	9e9fbe01f [FRONTEND] Fix specialization on triton integer types (#2236)
	walltime ms: 0.076
	c6d33dceb [ROCM] Core Functionality for AMD (#1983)
	walltime ms: 0.087
	ab3e8b0da [FRONTEND] fix handling of do_not_specialize with interior constantexprs (#2188)
	walltime ms: 0.088
	ebfe0ffb2 [FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114)
	walltime ms: 0.087
	6cb67185f [FRONTEND]To use proper default num_warps and num_stages based on the device backend in JITFucntion (#2130)
	walltime ms: 0.074
	23dd11d47 [BACKEND] Solidify f8e4m3 (#2105)
	walltime ms: 0.064
	fc667d1f8 [FRONTEND] fix new absolute imports (#2072)
	walltime ms: 0.064
	98372f46d [FRONTEND] Remove extra calls to _get_config causing runtime overhead (#2094)
	walltime ms: 0.075
	a01c116f7 [FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090)
	walltime ms: 0.257
	776b3784c [FRONTEND] further improve version_key speed (#2073)
	walltime ms: 0.248
	0e11257b8 [FRONTEND] improve speed of computing version_key (#2071)
	walltime ms: 0.249
	30a331e62 [FRONTEND] Support jit functions without arguments (#2043)
	walltime ms: 0.246