Can we leverage the framework specific profilers, and maybe build a layer on top of it augment its features and functionalities.
This is what mxnet profiler dump looks like
Profile Statistics.
Note that counter items are counter values and not time units.
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: gpu/0 722 2642262.5000 8388.6084 2654455.2500 1323033.3750
Memory: cpu/0 474 0.0000 0.0000 18984.9609 9492.4805
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXAutogradMarkVariables 133 0.7330 0.0040 0.0080 0.0055
MXNDArrayFree 411 0.8010 0.0000 0.1230 0.0019
...
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
FullyConnected 2 0.5520 0.2750 0.2770 0.2760
Flatten 2 0.3770 0.1880 0.1890 0.1885
...
The profiler dump needs some effort and familiarity to be readable and it gives a 0 ft view of the internals of the model. And profiling a model often involves using other third party tools and packages. This is where the Tornasole profiler can pitch in.
Generally though it won't give correct answers due to cuda sync issues, I believe.
However, I suspect if you use CUDA_LAUNCH_BLOCKING env var then it might work. That might be the trick to having any python profiler provide accurate info.
using tools like Nvidia-smi <>