chichunchen/LearnNvprof.md

## LearnNvprof.md

      
    Raw
  

              LearnNvprof.md
            
          
    nvprof - NVCC Profiler

It is Nvidia's Profiler, profiles any executable including CUDA programs.
How to use it?

nvprof ./executable

In case if you want the obsolute url /usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof where MAJOR.minor is your CUDA version installed.
What all you can see


Number of times kernel is invoked
Kernel execution time
Time taken to DtoH and HtoD
..

How to get all/certain meterics from nvprof


How to find all the metric available for the device? It is a big list see at EOF.

nvprof --query-metrics


How to query for all metric?

nvprof --metrics all ./executable

How to query for a specific metric? say  Dram reads.

nvprof --metrics dram_read_transactions ./executable
List


Available Metrics:  Name   Description
inst_per_warp:  Average number of instructions executed by each warp
branch_efficiency:  Ratio of non-divergent branches to total branches
warp_execution_efficiency:  Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
warp_nonpred_execution_efficiency:  Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
inst_replay_overhead:  Average number of replays for each instruction executed
shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load
shared_store_transactions_per_request:  Average number of shared memory store transactions performed for each shared memory store
local_load_transactions_per_request:  Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request:  Average number of local memory store transactions performed for each local memory store
gld_transactions_per_request:  Average number of global memory load transactions performed for each global memory load.
gst_transactions_per_request:  Average number of global memory store transactions performed for each global memory store
shared_store_transactions:  Number of shared memory store transactions
shared_load_transactions:  Number of shared memory load transactions
local_load_transactions:  Number of local memory load transactions
local_store_transactions:  Number of local memory store transactions
gld_transactions:  Number of global memory load transactions
gst_transactions:  Number of global memory store transactions
sysmem_read_transactions:  Number of system memory read transactions
sysmem_write_transactions:  Number of system memory write transactions
l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
dram_read_transactions:  Device memory read transactions
dram_write_transactions:  Device memory write transactions
global_hit_rate:  Hit rate for global loads in unified l1/tex cache
local_hit_rate:  Hit rate for local loads and stores
gld_requested_throughput:  Requested global memory load throughput
gst_requested_throughput:  Requested global memory store throughput
gld_throughput:  Global memory load throughput
gst_throughput:  Global memory store throughput
local_memory_overhead:  Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
tex_cache_hit_rate:  Unified cache hit rate
l2_tex_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
l2_tex_write_hit_rate:  Hit Rate at L2 cache for all write requests from texture cache
dram_read_throughput:  Device memory read throughput
dram_write_throughput:  Device memory write throughput
tex_cache_throughput:  Unified cache throughput
l2_tex_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
l2_tex_write_throughput:  Memory write throughput seen at L2 cache for write requests from the texture cache
l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
sysmem_read_throughput:  System memory read throughput
sysmem_write_throughput:  System memory write throughput
local_load_throughput:  Local memory load throughput
local_store_throughput:  Local memory store throughput
shared_load_throughput:  Shared memory load throughput
shared_store_throughput:  Shared memory store throughput
gld_efficiency:  Ratio of requested global memory load throughput to required global memory load throughput.
gst_efficiency:  Ratio of requested global memory store throughput to required global memory store throughput.
tex_cache_transactions:  Unified cache read transactions
flop_count_dp:  Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
flop_count_dp_add:  Number of double-precision floating-point add operations executed by non-predicated threads.
flop_count_dp_fma:  Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
flop_count_dp_mul:  Number of double-precision floating-point multiply operations executed by non-predicated threads.
flop_count_sp:  Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.
flop_count_sp_add:  Number of single-precision floating-point add operations executed by non-predicated threads.
flop_count_sp_fma:  Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
flop_count_sp_mul:  Number of single-precision floating-point multiply operations executed by non-predicated threads.
flop_count_sp_special:  Number of single-precision floating-point special operations executed by non-predicated threads.
inst_executed:  The number of instructions executed
inst_issued:  The number of instructions issued
dram_utilization:  The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10
sysmem_utilization:  The utilization level of the system memory relative to the peak utilization
stall_inst_fetch:  Percentage of stalls occurring because the next assembly instruction has not yet been fetched
stall_exec_dependency:  Percentage of stalls occurring because an input required by the instruction is not yet available
stall_memory_dependency:  Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
stall_texture:  Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
stall_sync:  Percentage of stalls occurring because the warp is blocked at a __syncthreads() call
stall_other:  Percentage of stalls occurring due to miscellaneous reasons
stall_constant_memory_dependency:  Percentage of stalls occurring because of immediate constant cache miss
stall_pipe_busy:  Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput
inst_fp_32:  Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
inst_fp_64:  Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
inst_integer:  Number of integer instructions executed by non-predicated threads
inst_bit_convert:  Number of bit-conversion instructions executed by non-predicated threads
inst_control:  Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
inst_compute_ld_st:  Number of compute load/store instructions executed by non-predicated threads
inst_misc:  Number of miscellaneous instructions executed by non-predicated threads
inst_inter_thread_communication:  Number of inter-thread communication instructions executed by non-predicated threads
issue_slots:  The number of issue slots used
cf_issued:  Number of issued control-flow instructions
cf_executed:  Number of executed control-flow instructions
ldst_issued:  Number of issued local, global, shared and texture memory load and store instructions
ldst_executed:  Number of executed local, global, shared and texture memory load and store instructions
atomic_transactions:  Global memory atomic and reduction transactions
atomic_transactions_per_request:  Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
l2_atomic_throughput:  Memory read throughput seen at L2 cache for atomic and reduction requests
l2_atomic_transactions:  Memory read transactions seen at L2 cache for atomic and reduction requests
l2_tex_read_transactions:  Memory read transactions seen at L2 cache for read requests from the texture cache
stall_memory_throttle:  Percentage of stalls occurring because of memory throttle
stall_not_selected:  Percentage of stalls occurring because warp was not selected
l2_tex_write_transactions:  Memory write transactions seen at L2 cache for write requests from the texture cache
flop_count_hp:  Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
flop_count_hp_add:  Number of half-precision floating-point add operations executed by non-predicated threads.
flop_count_hp_mul:  Number of half-precision floating-point multiply operations executed by non-predicated threads.
flop_count_hp_fma:  Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
inst_fp_16:  Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
ipc:  Instructions executed per cycle
issued_ipc:  Instructions issued per cycle
issue_slot_utilization:  Percentage of issue slots that issued at least one instruction, averaged across all cycles
sm_efficiency:  The percentage of time at least one warp is active on a specific multiprocessor
achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
eligible_warps_per_cycle:  Average number of warps that are eligible to issue per active cycle
shared_utilization:  The utilization level of the shared memory relative to peak utilization
l2_utilization:  The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10
tex_utilization:  The utilization level of the unified cache relative to the peak utilization
ldst_fu_utilization:  The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
cf_fu_utilization:  The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10
tex_fu_utilization:  The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10
special_fu_utilization:  The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10
half_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10
single_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions
double_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
flop_hp_efficiency:  Ratio of achieved to peak half-precision floating-point operations
flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations
flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations
sysmem_read_utilization:  The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
sysmem_write_utilization:  The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10