Skip to content

Instantly share code, notes, and snippets.

@sonots
Last active March 11, 2024 00:53
  • Star 86 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save sonots/5abc0bccec2010ac69ff74788b265086 to your computer and use it in GitHub Desktop.
How to use NVIDIA profiler

Usually, located at /usr/local/cuda/bin

Non-Visual Profiler

$ nvprof python train_mnist.py

I prefer to use --print-gpu-trace.

$ nvprof --print-gpu-trace python train_mnist.py

Visual Profiler

On GPU machine, run

$ nvprof -o prof.nvvp python train_mnist.py

Copy prof.nvvp into your local machine

$ scp your_gpu_machine:/path/to/prof.nvvp .

Then, run nvvp (nvidia visual profiler) on your local machine:

$ nvvp prof.nvvp

It works more comfortably than X11 forwarding or something.

@sonots
Copy link
Author

sonots commented Jul 11, 2017

For macOS: Install cuda toolkit for macOS. nvvp is included in it. Run nvvp command on console. You don't need GPU on your macOS.

https://developer.nvidia.com/cuda-downloads

$ /Developer/NVIDIA/CUDA-9.0/bin/nvvp prof.nvvp

@sonots
Copy link
Author

sonots commented Jul 21, 2017

@sonots
Copy link
Author

sonots commented Oct 19, 2017

An example of nvprof result:

$ nvprof python examples/stream/cusolver.py                                                                                                                                              [10/1910]
==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py
==27986== Profiling application: python examples/stream/cusolver.py
==27986== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 41.70%  125.73us         4  31.431us  30.336us  33.312us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
 21.94%  66.144us        36  1.8370us  1.7600us  2.1760us  [CUDA memcpy DtoH]
 13.77%  41.536us        48     865ns     800ns  1.4400us  [CUDA memcpy HtoD]
  3.02%  9.1200us         2  4.5600us  3.8720us  5.2480us  void syhemv_kernel<double, int=64, int=128, int=4, int=5, bool=1, bool=0>(cublasSyhemvParams<double>)
  2.65%  8.0000us         2  4.0000us  3.8720us  4.1280us  void gemv2T_kernel_val<double, double, double, int=128, int=16, int=2, int=2, bool=0>(int, int, double, double const *, int, double const *, i
nt, double, double*, int)
  2.63%  7.9360us         2  3.9680us  3.8720us  4.0640us  cupy_copy
  2.44%  7.3600us         2  3.6800us  3.1680us  4.1920us  void syr2_kernel<double, int=128, int=5, bool=1>(cublasSyher2Params<double>, int, double const *, double)
  2.23%  6.7200us         2  3.3600us  3.2960us  3.4240us  void dot_kernel<double, double, double, int=128, int=0, int=0>(cublasDotParams<double, double>)
  1.88%  5.6640us         2  2.8320us  2.7840us  2.8800us  void reduce_1Block_kernel<double, double, double, int=128, int=7>(double*, int, double*)
  1.74%  5.2480us         2  2.6240us  2.5600us  2.6880us  void ger_kernel<double, double, int=256, int=5, bool=0>(cublasGerParams<double, double>)
  1.57%  4.7360us         2  2.3680us  2.1760us  2.5600us  void axpy_kernel_val<double, double, int=0>(cublasAxpyParamsVal<double, double, double>)
  1.28%  3.8720us         2  1.9360us  1.7920us  2.0800us  void lacpy_kernel<double, int=5, int=3>(int, int, double const *, int, double*, int, int, int)
  1.19%  3.5840us         2  1.7920us  1.6960us  1.8880us  void scal_kernel_val<double, double, int=0>(cublasScalParamsVal<double, double>)
  0.98%  2.9440us         2  1.4720us  1.2160us  1.7280us  void reset_diagonal_real<double, int=8>(int, double*, int)
  0.98%  2.9440us         4     736ns     736ns     736ns  [CUDA memset]

==27986== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 60.34%  408.55ms         9  45.395ms  4.8480us  407.94ms  cudaMalloc
 37.60%  254.60ms         2  127.30ms     556ns  254.60ms  cudaFree
  0.94%  6.3542ms       712  8.9240us     119ns  428.32us  cuDeviceGetAttribute
  0.72%  4.8747ms         8  609.33us  320.37us  885.26us  cuDeviceTotalMem
  0.10%  693.60us        82  8.4580us  2.8370us  72.004us  cudaMemcpyAsync
  0.08%  511.79us         1  511.79us  511.79us  511.79us  cudaHostAlloc
  0.08%  511.75us         8  63.969us  41.317us  99.232us  cuDeviceGetName
  0.05%  310.04us         1  310.04us  310.04us  310.04us  cuModuleLoadData
  0.03%  234.87us        24  9.7860us  5.7190us  50.465us  cudaLaunch
  0.01%  50.874us         2  25.437us  16.898us  33.976us  cuLaunchKernel
  0.01%  49.923us         2  24.961us  15.602us  34.321us  cudaMemcpy
  0.01%  47.622us         4  11.905us  8.6190us  19.889us  cudaMemsetAsync
  0.01%  44.811us         2  22.405us  9.5590us  35.252us  cudaStreamDestroy
  0.01%  35.136us        27  1.3010us     289ns  5.8480us  cudaGetDevice
  0.00%  31.113us        24  1.2960us     972ns  3.2380us  cudaStreamSynchronize
  0.00%  30.736us         2  15.368us  4.4580us  26.278us  cudaStreamCreate
  0.00%  13.932us        17     819ns     414ns  3.7090us  cudaEventCreateWithFlags
  0.00%  13.678us        70     195ns     130ns     801ns  cudaSetupArgument
  0.00%  12.050us         4  3.0120us  2.1290us  4.5130us  cudaFuncGetAttributes
  0.00%  10.407us        22     473ns     268ns  1.9540us  cudaDeviceGetAttribute
  0.00%  10.370us        40     259ns     126ns  1.4100us  cudaGetLastError
  0.00%  9.9680us        16     623ns     185ns  2.9600us  cuDeviceGet

@sonots
Copy link
Author

sonots commented Oct 19, 2017

An example of nvprof --print-gpu-trace result:

$ nvprof --print-gpu-trace python examples/stream/cusolver.py
==28079== NVPROF is profiling process 28079, command: python examples/stream/cusolver.py
==28079== Profiling application: python examples/stream/cusolver.py
==28079== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
652.12ms  1.5360us                    -               -         -         -         -       72B  44.703MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
885.35ms  3.5520us              (1 1 1)         (9 1 1)        35        0B        0B         -           -  GeForce GTX TIT         1        13  cupy_copy [412]
1.17031s  1.2160us                    -               -         -         -         -      112B  87.838MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
1.17104s  1.2800us                    -               -         -         -         -        4B  2.9802MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17117s  2.2400us                    -               -         -         -         -       72B  30.654MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17119s     864ns                    -               -         -         -         -        4B  4.4152MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17123s  1.3760us              (1 1 1)       (256 1 1)         8        0B        0B         -           -  GeForce GTX TIT         1        13  void reset_diagonal_real<double, int=8>(int, double*, i
nt) [840]
1.17125s     768ns                    -               -         -         -         -       16B  19.868MB/s  GeForce GTX TIT         1        13  [CUDA memset]
1.17127s  32.928us              (1 1 1)       (128 1 1)        30  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void nrm2_kernel<double, double, double, int=0, int=0,
int=128, int=0>(cublasNrm2Params<double, double>) [848]
1.17130s  30.016us              (1 1 1)       (128 1 1)        30  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void nrm2_kernel<double, double, double, int=0, int=0,
int=128, int=0>(cublasNrm2Params<double, double>) [853]
1.17134s  2.0160us                    -               -         -         -         -        8B  3.7844MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17135s  1.7920us                    -               -         -         -         -        8B  4.2575MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17137s  1.8560us              (1 1 1)       (384 1 1)        10        0B        0B         -           -  GeForce GTX TIT         1        13  void scal_kernel_val<double, double, int=0>(cublasScalP
aramsVal<double, double>) [863]
1.17138s     832ns                    -               -         -         -         -        8B  9.1699MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17138s     864ns                    -               -         -         -         -        8B  8.8303MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17139s  1.8240us                    -               -         -         -         -        8B  4.1828MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17140s  1.8880us                    -               -         -         -         -        8B  4.0410MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17141s     864ns                    -               -         -         -         -        8B  8.8303MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17142s     832ns                    -               -         -         -         -        8B  9.1699MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17143s  5.6320us             (64 1 1)       (128 1 1)        48  5.5000KB        0B         -           -  GeForce GTX TIT         1        13  void syhemv_kernel<double, int=64, int=128, int=4, int=
5, bool=1, bool=0>(cublasSyhemvParams<double>) [875]
1.17145s  3.9360us              (1 1 1)       (128 1 1)        14  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void dot_kernel<double, double, double, int=128, int=0,
 int=0>(cublasDotParams<double, double>) [882]
1.17146s  3.0400us              (1 1 1)       (128 1 1)        16  1.5000KB        0B         -           -  GeForce GTX TIT         1        13  void reduce_1Block_kernel<double, double, double, int=1
28, int=7>(double*, int, double*) [888]

[omitted]

@sonots
Copy link
Author

sonots commented Oct 19, 2017

An example of nvvp result:

image

image

@shizukanaskytree
Copy link

Thank you for your post!

@rohith14
Copy link

rohith14 commented Nov 1, 2018

Hi,
I get "No kernels profiled" as seen shown below. Any idea?

$ nvprof --print-gpu-trace python train_mnist.py --network mlp --num-epochs 1
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=1, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
==27259== NVPROF is profiling process 27259, command: python train_mnist.py --network mlp --num-epochs 1
INFO:root:Epoch[0] Batch [100] Speed: 39195.15 samples/sec accuracy=0.779548
INFO:root:Epoch[0] Batch [200] Speed: 54730.25 samples/sec accuracy=0.915781
INFO:root:Epoch[0] Batch [300] Speed: 52417.13 samples/sec accuracy=0.923281
INFO:root:Epoch[0] Batch [400] Speed: 52111.75 samples/sec accuracy=0.935781
INFO:root:Epoch[0] Batch [500] Speed: 52394.11 samples/sec accuracy=0.946250
INFO:root:Epoch[0] Batch [600] Speed: 52185.51 samples/sec accuracy=0.947656
INFO:root:Epoch[0] Batch [700] Speed: 52354.36 samples/sec accuracy=0.953125
INFO:root:Epoch[0] Batch [800] Speed: 52180.64 samples/sec accuracy=0.958750
INFO:root:Epoch[0] Batch [900] Speed: 52312.02 samples/sec accuracy=0.953281
INFO:root:Epoch[0] Train-accuracy=0.955236
INFO:root:Epoch[0] Time cost=3.287
INFO:root:Epoch[0] Validation-accuracy=0.963774
==27259== Profiling application: python train_mnist.py --network mlp --num-epochs 1
==27259== Profiling result:
No kernels were profiled.

@Double1996
Copy link

Thank you very much!

@dorraghzela
Copy link

Hello,

Thank you for sharing this. I have a silly question actually, I want to know if the profiling time for example memcpy includes the time for the API call cudaMemcpy ??

Thank you

@jrevels
Copy link

jrevels commented Nov 1, 2019

Thanks!

To help out other MacOS users in case they run into the same problem I did:

Make sure to point nvvp to a supported version of the JRE:

/Developer/NVIDIA/CUDA-10.1/bin/nvvp -vm /Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home/jre/bin/java

Also make sure that the argument to -vm is an absolute path; it doesn't seem to understand relative paths.

@boydad
Copy link

boydad commented Nov 13, 2019

@jrevels, Thank you!

There are also problems with the installation of the Cuda toolkit on Catalina. It can be solved by adding a link /Developer to any place. To do it on Catalina you need to add a line
Developer /Users/userName
to file /etc/synthetic.conf : like it is written here.

@sumit-byte
Copy link

Hello, iam having problem on viewing the visual profiler. Every time it open upon selecting a file it returns a error, saying "The application being profilled returned a non-zero code"

@HenryYihengXu
Copy link

Hello, do you know how to see GPU utilization with nvvp? I only see the duration of a function. But I want to also see the percentage of GPU utilization during a function.

@keke8273
Copy link

keke8273 commented Mar 4, 2020

i have to use the --profile-child-processes option to get the profile to work on windows.

@hudannag
Copy link

I get no outputs using nvprof. No matter if I give arguments or not.
Doesn't work with .exe file build with VS2019, nor using nvprof python program.py.

Can you please help?

@hitvoice
Copy link

On Mac, nvvp prof.nvvp doesn't work.
An absolute path (instead of a relative path like "prof.nvvp") has to be provided.

@danpovey
Copy link

My recommendation to anyone who wants to install nvvp on an up-to-date Mac (catalina) is not to even try.
Firstly you have to disable gatekeeper:
sudo spctl --master-disable
but then after you download nvvp it won't run because it wants an outdated java runtime (JRE) that's hard to get or install.
I still haven't figured it out.
Currently I'm trying to figure out how to do this, but it looks like it's very very painful.

@1chimaruGin
Copy link

Thanks man! Great help

@ccjjs
Copy link

ccjjs commented Jun 12, 2021

PDD XBJ GOD, YYDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment