Skip to content

Instantly share code, notes, and snippets.

@Jokeren
Last active August 27, 2019 23:57
Show Gist options
  • Save Jokeren/1c4d5b4d32092ac3c410a2a9e21bd0e9 to your computer and use it in GitHub Desktop.
Save Jokeren/1c4d5b4d32092ac3c410a2a9e21bd0e9 to your computer and use it in GitHub Desktop.
Brief notes for master-gpu
I left files that I am not familar with.
spack/package.py | 131 ++
spack/packages.yaml | 16 +-
Add new packages for NVIDIA GPUs
src/lib/analysis/Args.hpp | 3 +
Initialize instruction mix analysis files. All the features are implemented in master-gpu-instruction-analyzer branch.
src/lib/analysis/ArgsHPCProf.cpp | 44 +-
Find cubin structure files
src/lib/analysis/CallPath-CudaCFG.cpp | 942 ++++++++
src/lib/analysis/CallPath-CudaCFG.hpp | 202 ++
Approximate GPU calling context trees when there are GPU INST metries
src/lib/analysis/CallPath.cpp | 91 +-
Do not coalesce call instructions
src/lib/banal/ElfHelper.cpp | 11 +-
src/lib/banal/ElfHelper.hpp | 21 +-
Cubin elf format helper
src/lib/banal/Fatbin.cpp | 13 +-
Handle nvidia’s fatbin that contains both CPU and GPU binaries
src/lib/banal/RelocateCubin.cpp | 273 ++-
src/lib/banal/RelocateCubin.hpp | 1 +
Relocate cubin symbol table
src/lib/banal/Struct-Inline.cpp | 33 +-
src/lib/banal/Struct-Inline.hpp | 31 +-
src/lib/banal/Struct-Output.cpp | 76 +-
Handle call instructions and nvidia device tag
src/lib/banal/Struct.cpp | 446 ++--
If nvdisasm succeeds, analyze loops; if nvdisasm fails, only find line mappings
src/lib/banal/cuda/CFGParser.cpp | 495 ++++
src/lib/banal/cuda/CFGParser.hpp | 52 +
parse dot file dumped by nvdisasm, construct cuda blocks and functions
src/lib/banal/cuda/CudaBlock.cpp | 34 +
src/lib/banal/cuda/CudaBlock.hpp | 26 +
Transform a CUDA Block to a Dyninst Block
src/lib/banal/cuda/CudaCFGFactory.cpp | 92 +
src/lib/banal/cuda/CudaCFGFactory.hpp | 32 +
Construct a Dyninst CFGFactory
src/lib/banal/cuda/CudaCodeSource.cpp | 15 +
src/lib/banal/cuda/CudaCodeSource.hpp | 55 +
Construct a Dyninst CodeSource
src/lib/banal/cuda/CudaFunction.cpp | 12 +
src/lib/banal/cuda/CudaFunction.hpp | 25 +
Transform a CUDA Function to a Dyninst Function
src/lib/banal/cuda/DotCFG.hpp | 224 ++
CUDA function and block data structures
src/lib/banal/cuda/Graph.hpp | 52 +
src/lib/banal/cuda/GraphReader.cpp | 85 +
src/lib/banal/cuda/GraphReader.hpp | 35 +
Use boost APIs to read CFG structures from a dot file
src/lib/banal/cuda/ReadCubinCFG.cpp | 319 +++
src/lib/banal/cuda/ReadCubinCFG.hpp | 17 +
Find nvdisasm and apply it on every valid function
src/lib/prof-lean/crypto-hash.c | 299 +++
src/lib/prof-lean/crypto-hash.h | 151 ++
src/lib/prof-lean/hash.c | 105 +
src/lib/prof-lean/hash.h | 86 +
compute a hash to identify unique memory segments
src/lib/prof-lean/placeholders.c | 87 +-
src/lib/prof-lean/placeholders.h | 51 +
Add special placeholder frames for CUDA and OpenMP
src/lib/prof/CCT-Tree.cpp | 30 +-
src/lib/prof/CCT-Tree.hpp | 188 +-
Add SCC frames; add clone functions
src/lib/prof/CallPath-Profile.cpp | 20 +-
Hide some metrics in hpcviewer
src/lib/prof/Struct-Tree.hpp | 37 +-
Differentiate call statements and ordinary statements
src/lib/profxml/PGMDocHandler.cpp | 60 +-
src/lib/profxml/PGMDocHandler.hpp | 3 +
Parse call and device tags in struct files
src/tool/hpcrun/cct/cct-node-vector.c | 71 +
src/tool/hpcrun/cct/cct-node-vector.h | 31 +
We use this data structure initially, but now I believe no code uses it anymore
src/tool/hpcrun/cct/cct.c | 458 +++-
src/tool/hpcrun/cct/cct.h | 43 +-
1. Add functions for counting and writing dummy nodes
2. Add functions for wait-free queues used in OMPT
src/tool/hpcrun/cct2metrics.c | 98 +-
src/tool/hpcrun/cct2metrics.h | 25 +-
1. Fix bugs when multiple metric kinds are used
2. Add a new function for moving metrics from one node to another
src/tool/hpcrun/control-knob.c | 101 +
src/tool/hpcrun/control-knob.h | 31 +
1. hpcrun -ck KEY1=VALUE,KEY2=VALUE
e.g., control device buffer size on GPUs
src/tool/hpcrun/device-finalizers.c | 23 +
src/tool/hpcrun/device-finalizers.h | 23 +
Flush back buffers on the devices when a thread terminates
src/tool/hpcrun/device-initializers.c | 41 +
src/tool/hpcrun/device-initializers.h | 6 +
Before program execution, register a module ignore function to ignore cuda modules.
The module ignore function is called when a new module is loaded.
src/tool/hpcrun/hpcrun-initializers.c | 43 +
src/tool/hpcrun/hpcrun-initializers.h | 24 +
Used by OMPT
src/tool/hpcrun/hpcrun_flag_stacks.c | 51 +
src/tool/hpcrun/hpcrun_flag_stacks.h | 12 +
Used to handle recursive dlopen and dlclose.
In monitor_pre_dlopen, push a flag that indicates if pre_dlopen succeeds or not.
In monitor_dlopen, pop out the current dlopen flag, only perform a dlopen flag==true.
monitor_dlclose and monitor_post_dlclose are analogous
src/tool/hpcrun/loadmap.c | 14 +-
use load map callback to invoke module ignore lookup while call stack unwinding
src/tool/hpcrun/main.c | 129 +-
src/tool/hpcrun/main.h | 3 +
1. Ignore threads created by cuda modules
2. dlopen handlers
src/tool/hpcrun/memory/hpcrun-malloc.h | 1 +
src/tool/hpcrun/memory/mem.c | 16 +
Add a safe malloc that could be used concurrently with event callback.
Previousely, during hpcrun_malloc, if an event is triggered, some values might be change by the event.
In hpcrun_safe_malloc, events are not allowed in the interval of malloc.
src/tool/hpcrun/messages/debug-flag.c | 3 +-
src/tool/hpcrun/messages/messages-sync.c | 2 +-
Be careful, I suspect there are some problems with these files
src/tool/hpcrun/metrics.c | 478 ++--
src/tool/hpcrun/metrics.h | 65 +-
fix bugs for sparse kind. The bugs were triggered with multiple metric kinds. For example, CPUTIME and nvidia-cuda events.
src/tool/hpcrun/module-ignore-map.c | 311 +++
src/tool/hpcrun/module-ignore-map.h | 123 +
Ignore libcuda, libcuda_rt, and libcupti
.../hpcrun/sample-sources/nvidia/cubin-hash-map.c | 175 ++
.../hpcrun/sample-sources/nvidia/cubin-hash-map.h | 47 +
<cubin id, cubin hash>
.../hpcrun/sample-sources/nvidia/cubin-id-map.c | 217 ++
.../hpcrun/sample-sources/nvidia/cubin-id-map.h | 72 +
<cubin id, hpctoolkit module id>
.../hpcrun/sample-sources/nvidia/cubin-symbols.c | 298 +++
.../hpcrun/sample-sources/nvidia/cubin-symbols.h | 81 +
relocate function addresses in cubins and return a vector of symbols
src/tool/hpcrun/sample-sources/nvidia/cuda-api.c | 211 ++
src/tool/hpcrun/sample-sources/nvidia/cuda-api.h | 95 +
use cuda api to look up gpu device property
.../hpcrun/sample-sources/nvidia/cuda-device-map.c | 265 +++
.../hpcrun/sample-sources/nvidia/cuda-device-map.h | 121 +
<device id, cuda properties>
.../nvidia/cuda-state-placeholders.c | 89 +
.../nvidia/cuda-state-placeholders.h | 27 +
cuda state markers
.../hpcrun/sample-sources/nvidia/cupti-analysis.c | 132 ++
.../hpcrun/sample-sources/nvidia/cupti-analysis.h | 27 +
analyze kernel occupancy and sm efficiency
src/tool/hpcrun/sample-sources/nvidia/cupti-api.c | 1947 ++++++++++++++++
src/tool/hpcrun/sample-sources/nvidia/cupti-api.h | 276 +++
cupti api wrappers. We check if a program links with libcuda or libcuda_rt.
If yes, we enable cupti apis. We register cupti callbacks at the start of a program execution.
We enable activity tracing when a new cuda context is created in the callback.
For cuda driver and runtime activities, we unwind the callbacks in the callback function and pass the call path to the cupti thread.
The cupti thread sends the measurement data back to the correponding application thread.
.../nvidia/cupti-correlation-id-map.c | 228 ++
.../nvidia/cupti-correlation-id-map.h | 93 +
<cuda correlation id, external correlation id>
.../sample-sources/nvidia/cupti-function-id-map.c | 180 ++
.../sample-sources/nvidia/cupti-function-id-map.h | 55 +
<cuda function id, cuda function index>
.../sample-sources/nvidia/cupti-host-op-map.c | 236 ++
.../sample-sources/nvidia/cupti-host-op-map.h | 80 +
<external correlation id, cct node>
src/tool/hpcrun/sample-sources/nvidia/cupti-node.c | 219 ++
src/tool/hpcrun/sample-sources/nvidia/cupti-node.h | 180 ++
.../hpcrun/sample-sources/nvidia/cupti-record.c | 124 +
.../hpcrun/sample-sources/nvidia/cupti-record.h | 77 +
.../hpcrun/sample-sources/nvidia/cupti-stack.c | 84 +
.../hpcrun/sample-sources/nvidia/cupti-stack.h | 63 +
structures and communication methods for wait-free channels. These codes are refactored in another branch.
src/tool/hpcrun/sample-sources/nvidia/nvidia.c | 1048 +++++++++
src/tool/hpcrun/sample-sources/nvidia/nvidia.h | 14 +
Initialize gpu metrics and cupti setting; attribute gpu metrics
src/tool/hpcrun/thread_data.c | 64 +-
src/tool/hpcrun/thread_data.h | 52 +-
Create a hash map to store receipt when a thread is initialized
src/tool/hpcrun/unwind/common/uw_hash.c | 174 ++
src/tool/hpcrun/unwind/common/uw_hash.h | 134 ++
src/tool/hpcrun/unwind/common/uw_recipe_map.c | 198 +-
src/tool/hpcrun/unwind/common/uw_recipe_map.h | 1 +
query and update the thread local hash map
src/tool/hpcstruct/Args.cpp | 188 +-
src/tool/hpcstruct/Args.hpp | 7 +-
Add arguments for outputing struct files
src/tool/hpcstruct/main.cpp | 92 +-
Pay attention, I am not sure why there is difference between master-gpu and master for this file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment