-
-
Save Jokeren/1c4d5b4d32092ac3c410a2a9e21bd0e9 to your computer and use it in GitHub Desktop.
Brief notes for master-gpu
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I left files that I am not familar with. | |
spack/package.py | 131 ++ | |
spack/packages.yaml | 16 +- | |
Add new packages for NVIDIA GPUs | |
src/lib/analysis/Args.hpp | 3 + | |
Initialize instruction mix analysis files. All the features are implemented in master-gpu-instruction-analyzer branch. | |
src/lib/analysis/ArgsHPCProf.cpp | 44 +- | |
Find cubin structure files | |
src/lib/analysis/CallPath-CudaCFG.cpp | 942 ++++++++ | |
src/lib/analysis/CallPath-CudaCFG.hpp | 202 ++ | |
Approximate GPU calling context trees when there are GPU INST metries | |
src/lib/analysis/CallPath.cpp | 91 +- | |
Do not coalesce call instructions | |
src/lib/banal/ElfHelper.cpp | 11 +- | |
src/lib/banal/ElfHelper.hpp | 21 +- | |
Cubin elf format helper | |
src/lib/banal/Fatbin.cpp | 13 +- | |
Handle nvidia’s fatbin that contains both CPU and GPU binaries | |
src/lib/banal/RelocateCubin.cpp | 273 ++- | |
src/lib/banal/RelocateCubin.hpp | 1 + | |
Relocate cubin symbol table | |
src/lib/banal/Struct-Inline.cpp | 33 +- | |
src/lib/banal/Struct-Inline.hpp | 31 +- | |
src/lib/banal/Struct-Output.cpp | 76 +- | |
Handle call instructions and nvidia device tag | |
src/lib/banal/Struct.cpp | 446 ++-- | |
If nvdisasm succeeds, analyze loops; if nvdisasm fails, only find line mappings | |
src/lib/banal/cuda/CFGParser.cpp | 495 ++++ | |
src/lib/banal/cuda/CFGParser.hpp | 52 + | |
parse dot file dumped by nvdisasm, construct cuda blocks and functions | |
src/lib/banal/cuda/CudaBlock.cpp | 34 + | |
src/lib/banal/cuda/CudaBlock.hpp | 26 + | |
Transform a CUDA Block to a Dyninst Block | |
src/lib/banal/cuda/CudaCFGFactory.cpp | 92 + | |
src/lib/banal/cuda/CudaCFGFactory.hpp | 32 + | |
Construct a Dyninst CFGFactory | |
src/lib/banal/cuda/CudaCodeSource.cpp | 15 + | |
src/lib/banal/cuda/CudaCodeSource.hpp | 55 + | |
Construct a Dyninst CodeSource | |
src/lib/banal/cuda/CudaFunction.cpp | 12 + | |
src/lib/banal/cuda/CudaFunction.hpp | 25 + | |
Transform a CUDA Function to a Dyninst Function | |
src/lib/banal/cuda/DotCFG.hpp | 224 ++ | |
CUDA function and block data structures | |
src/lib/banal/cuda/Graph.hpp | 52 + | |
src/lib/banal/cuda/GraphReader.cpp | 85 + | |
src/lib/banal/cuda/GraphReader.hpp | 35 + | |
Use boost APIs to read CFG structures from a dot file | |
src/lib/banal/cuda/ReadCubinCFG.cpp | 319 +++ | |
src/lib/banal/cuda/ReadCubinCFG.hpp | 17 + | |
Find nvdisasm and apply it on every valid function | |
src/lib/prof-lean/crypto-hash.c | 299 +++ | |
src/lib/prof-lean/crypto-hash.h | 151 ++ | |
src/lib/prof-lean/hash.c | 105 + | |
src/lib/prof-lean/hash.h | 86 + | |
compute a hash to identify unique memory segments | |
src/lib/prof-lean/placeholders.c | 87 +- | |
src/lib/prof-lean/placeholders.h | 51 + | |
Add special placeholder frames for CUDA and OpenMP | |
src/lib/prof/CCT-Tree.cpp | 30 +- | |
src/lib/prof/CCT-Tree.hpp | 188 +- | |
Add SCC frames; add clone functions | |
src/lib/prof/CallPath-Profile.cpp | 20 +- | |
Hide some metrics in hpcviewer | |
src/lib/prof/Struct-Tree.hpp | 37 +- | |
Differentiate call statements and ordinary statements | |
src/lib/profxml/PGMDocHandler.cpp | 60 +- | |
src/lib/profxml/PGMDocHandler.hpp | 3 + | |
Parse call and device tags in struct files | |
src/tool/hpcrun/cct/cct-node-vector.c | 71 + | |
src/tool/hpcrun/cct/cct-node-vector.h | 31 + | |
We use this data structure initially, but now I believe no code uses it anymore | |
src/tool/hpcrun/cct/cct.c | 458 +++- | |
src/tool/hpcrun/cct/cct.h | 43 +- | |
1. Add functions for counting and writing dummy nodes | |
2. Add functions for wait-free queues used in OMPT | |
src/tool/hpcrun/cct2metrics.c | 98 +- | |
src/tool/hpcrun/cct2metrics.h | 25 +- | |
1. Fix bugs when multiple metric kinds are used | |
2. Add a new function for moving metrics from one node to another | |
src/tool/hpcrun/control-knob.c | 101 + | |
src/tool/hpcrun/control-knob.h | 31 + | |
1. hpcrun -ck KEY1=VALUE,KEY2=VALUE | |
e.g., control device buffer size on GPUs | |
src/tool/hpcrun/device-finalizers.c | 23 + | |
src/tool/hpcrun/device-finalizers.h | 23 + | |
Flush back buffers on the devices when a thread terminates | |
src/tool/hpcrun/device-initializers.c | 41 + | |
src/tool/hpcrun/device-initializers.h | 6 + | |
Before program execution, register a module ignore function to ignore cuda modules. | |
The module ignore function is called when a new module is loaded. | |
src/tool/hpcrun/hpcrun-initializers.c | 43 + | |
src/tool/hpcrun/hpcrun-initializers.h | 24 + | |
Used by OMPT | |
src/tool/hpcrun/hpcrun_flag_stacks.c | 51 + | |
src/tool/hpcrun/hpcrun_flag_stacks.h | 12 + | |
Used to handle recursive dlopen and dlclose. | |
In monitor_pre_dlopen, push a flag that indicates if pre_dlopen succeeds or not. | |
In monitor_dlopen, pop out the current dlopen flag, only perform a dlopen flag==true. | |
monitor_dlclose and monitor_post_dlclose are analogous | |
src/tool/hpcrun/loadmap.c | 14 +- | |
use load map callback to invoke module ignore lookup while call stack unwinding | |
src/tool/hpcrun/main.c | 129 +- | |
src/tool/hpcrun/main.h | 3 + | |
1. Ignore threads created by cuda modules | |
2. dlopen handlers | |
src/tool/hpcrun/memory/hpcrun-malloc.h | 1 + | |
src/tool/hpcrun/memory/mem.c | 16 + | |
Add a safe malloc that could be used concurrently with event callback. | |
Previousely, during hpcrun_malloc, if an event is triggered, some values might be change by the event. | |
In hpcrun_safe_malloc, events are not allowed in the interval of malloc. | |
src/tool/hpcrun/messages/debug-flag.c | 3 +- | |
src/tool/hpcrun/messages/messages-sync.c | 2 +- | |
Be careful, I suspect there are some problems with these files | |
src/tool/hpcrun/metrics.c | 478 ++-- | |
src/tool/hpcrun/metrics.h | 65 +- | |
fix bugs for sparse kind. The bugs were triggered with multiple metric kinds. For example, CPUTIME and nvidia-cuda events. | |
src/tool/hpcrun/module-ignore-map.c | 311 +++ | |
src/tool/hpcrun/module-ignore-map.h | 123 + | |
Ignore libcuda, libcuda_rt, and libcupti | |
.../hpcrun/sample-sources/nvidia/cubin-hash-map.c | 175 ++ | |
.../hpcrun/sample-sources/nvidia/cubin-hash-map.h | 47 + | |
<cubin id, cubin hash> | |
.../hpcrun/sample-sources/nvidia/cubin-id-map.c | 217 ++ | |
.../hpcrun/sample-sources/nvidia/cubin-id-map.h | 72 + | |
<cubin id, hpctoolkit module id> | |
.../hpcrun/sample-sources/nvidia/cubin-symbols.c | 298 +++ | |
.../hpcrun/sample-sources/nvidia/cubin-symbols.h | 81 + | |
relocate function addresses in cubins and return a vector of symbols | |
src/tool/hpcrun/sample-sources/nvidia/cuda-api.c | 211 ++ | |
src/tool/hpcrun/sample-sources/nvidia/cuda-api.h | 95 + | |
use cuda api to look up gpu device property | |
.../hpcrun/sample-sources/nvidia/cuda-device-map.c | 265 +++ | |
.../hpcrun/sample-sources/nvidia/cuda-device-map.h | 121 + | |
<device id, cuda properties> | |
.../nvidia/cuda-state-placeholders.c | 89 + | |
.../nvidia/cuda-state-placeholders.h | 27 + | |
cuda state markers | |
.../hpcrun/sample-sources/nvidia/cupti-analysis.c | 132 ++ | |
.../hpcrun/sample-sources/nvidia/cupti-analysis.h | 27 + | |
analyze kernel occupancy and sm efficiency | |
src/tool/hpcrun/sample-sources/nvidia/cupti-api.c | 1947 ++++++++++++++++ | |
src/tool/hpcrun/sample-sources/nvidia/cupti-api.h | 276 +++ | |
cupti api wrappers. We check if a program links with libcuda or libcuda_rt. | |
If yes, we enable cupti apis. We register cupti callbacks at the start of a program execution. | |
We enable activity tracing when a new cuda context is created in the callback. | |
For cuda driver and runtime activities, we unwind the callbacks in the callback function and pass the call path to the cupti thread. | |
The cupti thread sends the measurement data back to the correponding application thread. | |
.../nvidia/cupti-correlation-id-map.c | 228 ++ | |
.../nvidia/cupti-correlation-id-map.h | 93 + | |
<cuda correlation id, external correlation id> | |
.../sample-sources/nvidia/cupti-function-id-map.c | 180 ++ | |
.../sample-sources/nvidia/cupti-function-id-map.h | 55 + | |
<cuda function id, cuda function index> | |
.../sample-sources/nvidia/cupti-host-op-map.c | 236 ++ | |
.../sample-sources/nvidia/cupti-host-op-map.h | 80 + | |
<external correlation id, cct node> | |
src/tool/hpcrun/sample-sources/nvidia/cupti-node.c | 219 ++ | |
src/tool/hpcrun/sample-sources/nvidia/cupti-node.h | 180 ++ | |
.../hpcrun/sample-sources/nvidia/cupti-record.c | 124 + | |
.../hpcrun/sample-sources/nvidia/cupti-record.h | 77 + | |
.../hpcrun/sample-sources/nvidia/cupti-stack.c | 84 + | |
.../hpcrun/sample-sources/nvidia/cupti-stack.h | 63 + | |
structures and communication methods for wait-free channels. These codes are refactored in another branch. | |
src/tool/hpcrun/sample-sources/nvidia/nvidia.c | 1048 +++++++++ | |
src/tool/hpcrun/sample-sources/nvidia/nvidia.h | 14 + | |
Initialize gpu metrics and cupti setting; attribute gpu metrics | |
src/tool/hpcrun/thread_data.c | 64 +- | |
src/tool/hpcrun/thread_data.h | 52 +- | |
Create a hash map to store receipt when a thread is initialized | |
src/tool/hpcrun/unwind/common/uw_hash.c | 174 ++ | |
src/tool/hpcrun/unwind/common/uw_hash.h | 134 ++ | |
src/tool/hpcrun/unwind/common/uw_recipe_map.c | 198 +- | |
src/tool/hpcrun/unwind/common/uw_recipe_map.h | 1 + | |
query and update the thread local hash map | |
src/tool/hpcstruct/Args.cpp | 188 +- | |
src/tool/hpcstruct/Args.hpp | 7 +- | |
Add arguments for outputing struct files | |
src/tool/hpcstruct/main.cpp | 92 +- | |
Pay attention, I am not sure why there is difference between master-gpu and master for this file | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment