Jokeren/NOTE

## NOTE
I left files that I am not familar with.

spack/package.py                               	|  131 ++
spack/packages.yaml                            	|   16 +-

Add new packages for NVIDIA GPUs

 src/lib/analysis/Args.hpp                      	|	3 +

Initialize instruction mix analysis files. All the features are implemented in master-gpu-instruction-analyzer branch.

 src/lib/analysis/ArgsHPCProf.cpp               	|   44 +-

Find cubin structure files

 src/lib/analysis/CallPath-CudaCFG.cpp          	|  942 ++++++++
 src/lib/analysis/CallPath-CudaCFG.hpp          	|  202 ++

Approximate GPU calling context trees when there are GPU INST metries

 src/lib/analysis/CallPath.cpp                  	|   91 +-

Do not coalesce call instructions

 src/lib/banal/ElfHelper.cpp                    	|   11 +-
 src/lib/banal/ElfHelper.hpp                    	|   21 +-

Cubin elf format helper

 src/lib/banal/Fatbin.cpp                       	|   13 +-

Handle nvidia’s fatbin that contains both CPU and GPU binaries

 src/lib/banal/RelocateCubin.cpp                	|  273 ++-
 src/lib/banal/RelocateCubin.hpp                	|	1 +

Relocate cubin symbol table

 src/lib/banal/Struct-Inline.cpp                	|   33 +-
 src/lib/banal/Struct-Inline.hpp                	|   31 +-
 src/lib/banal/Struct-Output.cpp                	|   76 +-

Handle call instructions and nvidia device tag

 src/lib/banal/Struct.cpp                       	|  446 ++--

If nvdisasm succeeds, analyze loops; if nvdisasm fails, only find line mappings

 src/lib/banal/cuda/CFGParser.cpp               	|  495 ++++
 src/lib/banal/cuda/CFGParser.hpp               	|   52 +

parse dot file dumped by nvdisasm, construct cuda blocks and functions

 src/lib/banal/cuda/CudaBlock.cpp               	|   34 +
 src/lib/banal/cuda/CudaBlock.hpp               	|   26 +

Transform a CUDA Block to a Dyninst Block

 src/lib/banal/cuda/CudaCFGFactory.cpp          	|   92 +
 src/lib/banal/cuda/CudaCFGFactory.hpp          	|   32 +

Construct a Dyninst CFGFactory

 src/lib/banal/cuda/CudaCodeSource.cpp          	|   15 +
 src/lib/banal/cuda/CudaCodeSource.hpp          	|   55 +

Construct a Dyninst CodeSource

 src/lib/banal/cuda/CudaFunction.cpp            	|   12 +
 src/lib/banal/cuda/CudaFunction.hpp            	|   25 +

	Transform a CUDA Function to a Dyninst Function

 src/lib/banal/cuda/DotCFG.hpp                  	|  224 ++

	CUDA function and block data structures

 src/lib/banal/cuda/Graph.hpp                   	|   52 +
 src/lib/banal/cuda/GraphReader.cpp             	|   85 +
 src/lib/banal/cuda/GraphReader.hpp             	|   35 +

Use boost APIs to read CFG structures from a dot file

 src/lib/banal/cuda/ReadCubinCFG.cpp            	|  319 +++
src/lib/banal/cuda/ReadCubinCFG.hpp            	|   17 +

	Find nvdisasm and apply it on every valid function

 src/lib/prof-lean/crypto-hash.c                	|  299 +++
 src/lib/prof-lean/crypto-hash.h                	|  151 ++
 src/lib/prof-lean/hash.c                       	|  105 +
 src/lib/prof-lean/hash.h                       	|   86 +

compute a hash to identify unique memory segments

 src/lib/prof-lean/placeholders.c               	|   87 +-
 src/lib/prof-lean/placeholders.h               	|   51 +

Add special placeholder frames for CUDA and OpenMP

 src/lib/prof/CCT-Tree.cpp                      	|   30 +-
 src/lib/prof/CCT-Tree.hpp                      	|  188 +-

	Add SCC frames; add clone functions

 src/lib/prof/CallPath-Profile.cpp              	|   20 +-

	Hide some metrics in hpcviewer

 src/lib/prof/Struct-Tree.hpp                   	|   37 +-

Differentiate call statements and ordinary statements

 src/lib/profxml/PGMDocHandler.cpp              	|   60 +-
 src/lib/profxml/PGMDocHandler.hpp              	|	3 +

Parse call and device tags in struct files

 src/tool/hpcrun/cct/cct-node-vector.c          	|   71 +
 src/tool/hpcrun/cct/cct-node-vector.h          	|   31 +

We use this data structure initially, but now I believe no code uses it anymore

 src/tool/hpcrun/cct/cct.c                      	|  458 +++-

 src/tool/hpcrun/cct/cct.h                      	|   43 +-

1.	Add functions for counting and writing dummy nodes
2.	Add functions for wait-free queues used in OMPT

 src/tool/hpcrun/cct2metrics.c                  	|   98 +-
 src/tool/hpcrun/cct2metrics.h                  	|   25 +-

1.	Fix bugs when multiple metric kinds are used
2.	Add a new function for moving metrics from one node to another

 src/tool/hpcrun/control-knob.c                 	|  101 +
 src/tool/hpcrun/control-knob.h                 	|   31 +

1.	hpcrun -ck KEY1=VALUE,KEY2=VALUE
e.g., control device buffer size on GPUs

 src/tool/hpcrun/device-finalizers.c            	|   23 +
 src/tool/hpcrun/device-finalizers.h            	|   23 +

Flush back buffers on the devices when a thread terminates

 src/tool/hpcrun/device-initializers.c          	|   41 +
 src/tool/hpcrun/device-initializers.h          	|	6 +

Before program execution, register a module ignore function to ignore cuda modules.
The module ignore function is called when a new module is loaded.

 src/tool/hpcrun/hpcrun-initializers.c          	|   43 +
 src/tool/hpcrun/hpcrun-initializers.h          	|   24 +

	Used by OMPT

 src/tool/hpcrun/hpcrun_flag_stacks.c           	|   51 +
 src/tool/hpcrun/hpcrun_flag_stacks.h           	|   12 +

	Used to handle recursive dlopen and dlclose.
	In monitor_pre_dlopen, push a flag that indicates if pre_dlopen succeeds or not.
	In monitor_dlopen, pop out the current dlopen flag, only perform a dlopen flag==true.
	monitor_dlclose and monitor_post_dlclose are analogous

 src/tool/hpcrun/loadmap.c                      	|   14 +-

	use load map callback to invoke module ignore lookup while call stack unwinding

 src/tool/hpcrun/main.c                         	|  129 +-
 src/tool/hpcrun/main.h                         	|	3 +

1.	Ignore threads created by cuda modules
2.	dlopen handlers

 src/tool/hpcrun/memory/hpcrun-malloc.h         	|	1 +
 src/tool/hpcrun/memory/mem.c                   	|   16 +

Add a safe malloc that could be used concurrently with event callback.
Previousely, during hpcrun_malloc, if an event is triggered, some values might be change by the event.
In hpcrun_safe_malloc, events are not allowed in the interval of malloc.

 src/tool/hpcrun/messages/debug-flag.c          	|	3 +-
 src/tool/hpcrun/messages/messages-sync.c       	|    2 +-

Be careful, I suspect there are some problems with these files

 src/tool/hpcrun/metrics.c                      	|  478 ++--
 src/tool/hpcrun/metrics.h                      	|   65 +-

fix bugs for sparse kind. The bugs were triggered with multiple metric kinds. For example, CPUTIME and nvidia-cuda events.

 src/tool/hpcrun/module-ignore-map.c            	|  311 +++
 src/tool/hpcrun/module-ignore-map.h            	|  123 +

Ignore libcuda, libcuda_rt, and libcupti

 .../hpcrun/sample-sources/nvidia/cubin-hash-map.c  |  175 ++
 .../hpcrun/sample-sources/nvidia/cubin-hash-map.h  |   47 +

	<cubin id, cubin hash>

 .../hpcrun/sample-sources/nvidia/cubin-id-map.c	|  217 ++

 .../hpcrun/sample-sources/nvidia/cubin-id-map.h	|   72 +

	<cubin id, hpctoolkit module id>

 .../hpcrun/sample-sources/nvidia/cubin-symbols.c   |  298 +++
 .../hpcrun/sample-sources/nvidia/cubin-symbols.h   |   81 +

relocate function addresses in cubins and return a vector of symbols

 src/tool/hpcrun/sample-sources/nvidia/cuda-api.c   |  211 ++
 src/tool/hpcrun/sample-sources/nvidia/cuda-api.h   |   95 +

use cuda api to look up gpu device property

 .../hpcrun/sample-sources/nvidia/cuda-device-map.c |  265 +++
 .../hpcrun/sample-sources/nvidia/cuda-device-map.h |  121 +

	<device id, cuda properties>

 .../nvidia/cuda-state-placeholders.c           	|   89 +
 .../nvidia/cuda-state-placeholders.h           	|   27 +

	cuda state markers

 .../hpcrun/sample-sources/nvidia/cupti-analysis.c  |  132 ++
 .../hpcrun/sample-sources/nvidia/cupti-analysis.h  |   27 +

	analyze kernel occupancy and sm efficiency

 src/tool/hpcrun/sample-sources/nvidia/cupti-api.c  | 1947 ++++++++++++++++
 src/tool/hpcrun/sample-sources/nvidia/cupti-api.h  |  276 +++

	cupti api wrappers. We check if a program links with libcuda or libcuda_rt.
	If yes, we enable cupti apis. We register cupti callbacks at the start of a program execution.
	We enable activity tracing when a new cuda context is created in the callback.
	For cuda driver and runtime activities, we unwind the callbacks in the callback function and pass the call path to the cupti thread.
	The cupti thread sends the measurement data back to the correponding application thread.

 .../nvidia/cupti-correlation-id-map.c          	|  228 ++

 .../nvidia/cupti-correlation-id-map.h          	|   93 +

	<cuda correlation id, external correlation id>

 .../sample-sources/nvidia/cupti-function-id-map.c  |  180 ++
 .../sample-sources/nvidia/cupti-function-id-map.h  |   55 +

	<cuda function id, cuda function index>

 .../sample-sources/nvidia/cupti-host-op-map.c  	|  236 ++
 .../sample-sources/nvidia/cupti-host-op-map.h  	|   80 +

	<external correlation id, cct node>

 src/tool/hpcrun/sample-sources/nvidia/cupti-node.c |  219 ++
 src/tool/hpcrun/sample-sources/nvidia/cupti-node.h |  180 ++

 .../hpcrun/sample-sources/nvidia/cupti-record.c	|  124 +
 .../hpcrun/sample-sources/nvidia/cupti-record.h	|   77 +

 .../hpcrun/sample-sources/nvidia/cupti-stack.c 	|   84 +
 .../hpcrun/sample-sources/nvidia/cupti-stack.h 	|   63 +

structures and communication methods for wait-free channels. These codes are refactored in another branch.

 src/tool/hpcrun/sample-sources/nvidia/nvidia.c 	| 1048 +++++++++
 src/tool/hpcrun/sample-sources/nvidia/nvidia.h 	|   14 +

Initialize gpu metrics and cupti setting; attribute gpu metrics

 src/tool/hpcrun/thread_data.c                  	|   64 +-
 src/tool/hpcrun/thread_data.h                  	|   52 +-

Create a hash map to store receipt when a thread is initialized

 src/tool/hpcrun/unwind/common/uw_hash.c        	|  174 ++
 src/tool/hpcrun/unwind/common/uw_hash.h        	|  134 ++

 src/tool/hpcrun/unwind/common/uw_recipe_map.c  	|  198 +-
 src/tool/hpcrun/unwind/common/uw_recipe_map.h  	|    1 +

query and update the thread local hash map

 src/tool/hpcstruct/Args.cpp                    	|  188 +-
 src/tool/hpcstruct/Args.hpp                    	|	7 +-

Add arguments for outputing struct files

 src/tool/hpcstruct/main.cpp                    	|   92 +-

Pay attention, I am not sure why there is difference between master-gpu and master for this file
	I left files that I am not familar with.

	spack/package.py \| 131 ++
	spack/packages.yaml \| 16 +-

	Add new packages for NVIDIA GPUs

	src/lib/analysis/Args.hpp \| 3 +

	Initialize instruction mix analysis files. All the features are implemented in master-gpu-instruction-analyzer branch.

	src/lib/analysis/ArgsHPCProf.cpp \| 44 +-

	Find cubin structure files

	src/lib/analysis/CallPath-CudaCFG.cpp \| 942 ++++++++
	src/lib/analysis/CallPath-CudaCFG.hpp \| 202 ++

	Approximate GPU calling context trees when there are GPU INST metries

	src/lib/analysis/CallPath.cpp \| 91 +-

	Do not coalesce call instructions

	src/lib/banal/ElfHelper.cpp \| 11 +-
	src/lib/banal/ElfHelper.hpp \| 21 +-

	Cubin elf format helper

	src/lib/banal/Fatbin.cpp \| 13 +-

	Handle nvidia’s fatbin that contains both CPU and GPU binaries

	src/lib/banal/RelocateCubin.cpp \| 273 ++-
	src/lib/banal/RelocateCubin.hpp \| 1 +

	Relocate cubin symbol table

	src/lib/banal/Struct-Inline.cpp \| 33 +-
	src/lib/banal/Struct-Inline.hpp \| 31 +-
	src/lib/banal/Struct-Output.cpp \| 76 +-

	Handle call instructions and nvidia device tag

	src/lib/banal/Struct.cpp \| 446 ++--

	If nvdisasm succeeds, analyze loops; if nvdisasm fails, only find line mappings

	src/lib/banal/cuda/CFGParser.cpp \| 495 ++++
	src/lib/banal/cuda/CFGParser.hpp \| 52 +

	parse dot file dumped by nvdisasm, construct cuda blocks and functions

	src/lib/banal/cuda/CudaBlock.cpp \| 34 +
	src/lib/banal/cuda/CudaBlock.hpp \| 26 +

	Transform a CUDA Block to a Dyninst Block

	src/lib/banal/cuda/CudaCFGFactory.cpp \| 92 +
	src/lib/banal/cuda/CudaCFGFactory.hpp \| 32 +

	Construct a Dyninst CFGFactory

	src/lib/banal/cuda/CudaCodeSource.cpp \| 15 +
	src/lib/banal/cuda/CudaCodeSource.hpp \| 55 +

	Construct a Dyninst CodeSource

	src/lib/banal/cuda/CudaFunction.cpp \| 12 +
	src/lib/banal/cuda/CudaFunction.hpp \| 25 +

	Transform a CUDA Function to a Dyninst Function

	src/lib/banal/cuda/DotCFG.hpp \| 224 ++

	CUDA function and block data structures

	src/lib/banal/cuda/Graph.hpp \| 52 +
	src/lib/banal/cuda/GraphReader.cpp \| 85 +
	src/lib/banal/cuda/GraphReader.hpp \| 35 +

	Use boost APIs to read CFG structures from a dot file

	src/lib/banal/cuda/ReadCubinCFG.cpp \| 319 +++
	src/lib/banal/cuda/ReadCubinCFG.hpp \| 17 +

	Find nvdisasm and apply it on every valid function

	src/lib/prof-lean/crypto-hash.c \| 299 +++
	src/lib/prof-lean/crypto-hash.h \| 151 ++
	src/lib/prof-lean/hash.c \| 105 +
	src/lib/prof-lean/hash.h \| 86 +

	compute a hash to identify unique memory segments

	src/lib/prof-lean/placeholders.c \| 87 +-
	src/lib/prof-lean/placeholders.h \| 51 +

	Add special placeholder frames for CUDA and OpenMP

	src/lib/prof/CCT-Tree.cpp \| 30 +-
	src/lib/prof/CCT-Tree.hpp \| 188 +-

	Add SCC frames; add clone functions

	src/lib/prof/CallPath-Profile.cpp \| 20 +-

	Hide some metrics in hpcviewer

	src/lib/prof/Struct-Tree.hpp \| 37 +-

	Differentiate call statements and ordinary statements

	src/lib/profxml/PGMDocHandler.cpp \| 60 +-
	src/lib/profxml/PGMDocHandler.hpp \| 3 +

	Parse call and device tags in struct files

	src/tool/hpcrun/cct/cct-node-vector.c \| 71 +
	src/tool/hpcrun/cct/cct-node-vector.h \| 31 +

	We use this data structure initially, but now I believe no code uses it anymore

	src/tool/hpcrun/cct/cct.c \| 458 +++-

	src/tool/hpcrun/cct/cct.h \| 43 +-

	1. Add functions for counting and writing dummy nodes
	2. Add functions for wait-free queues used in OMPT

	src/tool/hpcrun/cct2metrics.c \| 98 +-
	src/tool/hpcrun/cct2metrics.h \| 25 +-

	1. Fix bugs when multiple metric kinds are used
	2. Add a new function for moving metrics from one node to another

	src/tool/hpcrun/control-knob.c \| 101 +
	src/tool/hpcrun/control-knob.h \| 31 +

	1. hpcrun -ck KEY1=VALUE,KEY2=VALUE
	e.g., control device buffer size on GPUs

	src/tool/hpcrun/device-finalizers.c \| 23 +
	src/tool/hpcrun/device-finalizers.h \| 23 +

	Flush back buffers on the devices when a thread terminates

	src/tool/hpcrun/device-initializers.c \| 41 +
	src/tool/hpcrun/device-initializers.h \| 6 +

	Before program execution, register a module ignore function to ignore cuda modules.
	The module ignore function is called when a new module is loaded.

	src/tool/hpcrun/hpcrun-initializers.c \| 43 +
	src/tool/hpcrun/hpcrun-initializers.h \| 24 +

	Used by OMPT

	src/tool/hpcrun/hpcrun_flag_stacks.c \| 51 +
	src/tool/hpcrun/hpcrun_flag_stacks.h \| 12 +

	Used to handle recursive dlopen and dlclose.
	In monitor_pre_dlopen, push a flag that indicates if pre_dlopen succeeds or not.
	In monitor_dlopen, pop out the current dlopen flag, only perform a dlopen flag==true.
	monitor_dlclose and monitor_post_dlclose are analogous

	src/tool/hpcrun/loadmap.c \| 14 +-

	use load map callback to invoke module ignore lookup while call stack unwinding

	src/tool/hpcrun/main.c \| 129 +-
	src/tool/hpcrun/main.h \| 3 +

	1. Ignore threads created by cuda modules
	2. dlopen handlers

	src/tool/hpcrun/memory/hpcrun-malloc.h \| 1 +
	src/tool/hpcrun/memory/mem.c \| 16 +

	Add a safe malloc that could be used concurrently with event callback.
	Previousely, during hpcrun_malloc, if an event is triggered, some values might be change by the event.
	In hpcrun_safe_malloc, events are not allowed in the interval of malloc.

	src/tool/hpcrun/messages/debug-flag.c \| 3 +-
	src/tool/hpcrun/messages/messages-sync.c \| 2 +-

	Be careful, I suspect there are some problems with these files

	src/tool/hpcrun/metrics.c \| 478 ++--
	src/tool/hpcrun/metrics.h \| 65 +-

	fix bugs for sparse kind. The bugs were triggered with multiple metric kinds. For example, CPUTIME and nvidia-cuda events.

	src/tool/hpcrun/module-ignore-map.c \| 311 +++
	src/tool/hpcrun/module-ignore-map.h \| 123 +

	Ignore libcuda, libcuda_rt, and libcupti

	.../hpcrun/sample-sources/nvidia/cubin-hash-map.c \| 175 ++
	.../hpcrun/sample-sources/nvidia/cubin-hash-map.h \| 47 +

	<cubin id, cubin hash>

	.../hpcrun/sample-sources/nvidia/cubin-id-map.c \| 217 ++

	.../hpcrun/sample-sources/nvidia/cubin-id-map.h \| 72 +

	<cubin id, hpctoolkit module id>

	.../hpcrun/sample-sources/nvidia/cubin-symbols.c \| 298 +++
	.../hpcrun/sample-sources/nvidia/cubin-symbols.h \| 81 +

	relocate function addresses in cubins and return a vector of symbols

	src/tool/hpcrun/sample-sources/nvidia/cuda-api.c \| 211 ++
	src/tool/hpcrun/sample-sources/nvidia/cuda-api.h \| 95 +

	use cuda api to look up gpu device property

	.../hpcrun/sample-sources/nvidia/cuda-device-map.c \| 265 +++
	.../hpcrun/sample-sources/nvidia/cuda-device-map.h \| 121 +

	<device id, cuda properties>

	.../nvidia/cuda-state-placeholders.c \| 89 +
	.../nvidia/cuda-state-placeholders.h \| 27 +

	cuda state markers

	.../hpcrun/sample-sources/nvidia/cupti-analysis.c \| 132 ++
	.../hpcrun/sample-sources/nvidia/cupti-analysis.h \| 27 +

	analyze kernel occupancy and sm efficiency

	src/tool/hpcrun/sample-sources/nvidia/cupti-api.c \| 1947 ++++++++++++++++
	src/tool/hpcrun/sample-sources/nvidia/cupti-api.h \| 276 +++

	cupti api wrappers. We check if a program links with libcuda or libcuda_rt.
	If yes, we enable cupti apis. We register cupti callbacks at the start of a program execution.
	We enable activity tracing when a new cuda context is created in the callback.
	For cuda driver and runtime activities, we unwind the callbacks in the callback function and pass the call path to the cupti thread.
	The cupti thread sends the measurement data back to the correponding application thread.

	.../nvidia/cupti-correlation-id-map.c \| 228 ++

	.../nvidia/cupti-correlation-id-map.h \| 93 +

	<cuda correlation id, external correlation id>

	.../sample-sources/nvidia/cupti-function-id-map.c \| 180 ++
	.../sample-sources/nvidia/cupti-function-id-map.h \| 55 +

	<cuda function id, cuda function index>

	.../sample-sources/nvidia/cupti-host-op-map.c \| 236 ++
	.../sample-sources/nvidia/cupti-host-op-map.h \| 80 +

	<external correlation id, cct node>

	src/tool/hpcrun/sample-sources/nvidia/cupti-node.c \| 219 ++
	src/tool/hpcrun/sample-sources/nvidia/cupti-node.h \| 180 ++

	.../hpcrun/sample-sources/nvidia/cupti-record.c \| 124 +
	.../hpcrun/sample-sources/nvidia/cupti-record.h \| 77 +

	.../hpcrun/sample-sources/nvidia/cupti-stack.c \| 84 +
	.../hpcrun/sample-sources/nvidia/cupti-stack.h \| 63 +

	structures and communication methods for wait-free channels. These codes are refactored in another branch.

	src/tool/hpcrun/sample-sources/nvidia/nvidia.c \| 1048 +++++++++
	src/tool/hpcrun/sample-sources/nvidia/nvidia.h \| 14 +

	Initialize gpu metrics and cupti setting; attribute gpu metrics

	src/tool/hpcrun/thread_data.c \| 64 +-
	src/tool/hpcrun/thread_data.h \| 52 +-

	Create a hash map to store receipt when a thread is initialized

	src/tool/hpcrun/unwind/common/uw_hash.c \| 174 ++
	src/tool/hpcrun/unwind/common/uw_hash.h \| 134 ++

	src/tool/hpcrun/unwind/common/uw_recipe_map.c \| 198 +-
	src/tool/hpcrun/unwind/common/uw_recipe_map.h \| 1 +

	query and update the thread local hash map

	src/tool/hpcstruct/Args.cpp \| 188 +-
	src/tool/hpcstruct/Args.hpp \| 7 +-

	Add arguments for outputing struct files

	src/tool/hpcstruct/main.cpp \| 92 +-

	Pay attention, I am not sure why there is difference between master-gpu and master for this file