Skip to content

Instantly share code, notes, and snippets.

Last active April 24, 2024 13:54
Show Gist options
  • Save sonots/5abc0bccec2010ac69ff74788b265086 to your computer and use it in GitHub Desktop.
Save sonots/5abc0bccec2010ac69ff74788b265086 to your computer and use it in GitHub Desktop.
How to use NVIDIA profiler

Usually, located at /usr/local/cuda/bin

Non-Visual Profiler

$ nvprof python

I prefer to use --print-gpu-trace.

$ nvprof --print-gpu-trace python

Visual Profiler

On GPU machine, run

$ nvprof -o prof.nvvp python

Copy prof.nvvp into your local machine

$ scp your_gpu_machine:/path/to/prof.nvvp .

Then, run nvvp (nvidia visual profiler) on your local machine:

$ nvvp prof.nvvp

It works more comfortably than X11 forwarding or something.

Copy link

sonots commented Oct 19, 2017

An example of nvprof --print-gpu-trace result:

$ nvprof --print-gpu-trace python examples/stream/
==28079== NVPROF is profiling process 28079, command: python examples/stream/
==28079== Profiling application: python examples/stream/
==28079== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
652.12ms  1.5360us                    -               -         -         -         -       72B  44.703MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
885.35ms  3.5520us              (1 1 1)         (9 1 1)        35        0B        0B         -           -  GeForce GTX TIT         1        13  cupy_copy [412]
1.17031s  1.2160us                    -               -         -         -         -      112B  87.838MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
1.17104s  1.2800us                    -               -         -         -         -        4B  2.9802MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17117s  2.2400us                    -               -         -         -         -       72B  30.654MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17119s     864ns                    -               -         -         -         -        4B  4.4152MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17123s  1.3760us              (1 1 1)       (256 1 1)         8        0B        0B         -           -  GeForce GTX TIT         1        13  void reset_diagonal_real<double, int=8>(int, double*, i
nt) [840]
1.17125s     768ns                    -               -         -         -         -       16B  19.868MB/s  GeForce GTX TIT         1        13  [CUDA memset]
1.17127s  32.928us              (1 1 1)       (128 1 1)        30  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void nrm2_kernel<double, double, double, int=0, int=0,
int=128, int=0>(cublasNrm2Params<double, double>) [848]
1.17130s  30.016us              (1 1 1)       (128 1 1)        30  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void nrm2_kernel<double, double, double, int=0, int=0,
int=128, int=0>(cublasNrm2Params<double, double>) [853]
1.17134s  2.0160us                    -               -         -         -         -        8B  3.7844MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17135s  1.7920us                    -               -         -         -         -        8B  4.2575MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17137s  1.8560us              (1 1 1)       (384 1 1)        10        0B        0B         -           -  GeForce GTX TIT         1        13  void scal_kernel_val<double, double, int=0>(cublasScalP
aramsVal<double, double>) [863]
1.17138s     832ns                    -               -         -         -         -        8B  9.1699MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17138s     864ns                    -               -         -         -         -        8B  8.8303MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17139s  1.8240us                    -               -         -         -         -        8B  4.1828MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17140s  1.8880us                    -               -         -         -         -        8B  4.0410MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17141s     864ns                    -               -         -         -         -        8B  8.8303MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17142s     832ns                    -               -         -         -         -        8B  9.1699MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17143s  5.6320us             (64 1 1)       (128 1 1)        48  5.5000KB        0B         -           -  GeForce GTX TIT         1        13  void syhemv_kernel<double, int=64, int=128, int=4, int=
5, bool=1, bool=0>(cublasSyhemvParams<double>) [875]
1.17145s  3.9360us              (1 1 1)       (128 1 1)        14  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void dot_kernel<double, double, double, int=128, int=0,
 int=0>(cublasDotParams<double, double>) [882]
1.17146s  3.0400us              (1 1 1)       (128 1 1)        16  1.5000KB        0B         -           -  GeForce GTX TIT         1        13  void reduce_1Block_kernel<double, double, double, int=1
28, int=7>(double*, int, double*) [888]


Copy link

sonots commented Oct 19, 2017

An example of nvvp result:



Copy link

Thank you for your post!

Copy link

rohith14 commented Nov 1, 2018

I get "No kernels profiled" as seen shown below. Any idea?

$ nvprof --print-gpu-trace python --network mlp --num-epochs 1
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=1, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
==27259== NVPROF is profiling process 27259, command: python --network mlp --num-epochs 1
INFO:root:Epoch[0] Batch [100] Speed: 39195.15 samples/sec accuracy=0.779548
INFO:root:Epoch[0] Batch [200] Speed: 54730.25 samples/sec accuracy=0.915781
INFO:root:Epoch[0] Batch [300] Speed: 52417.13 samples/sec accuracy=0.923281
INFO:root:Epoch[0] Batch [400] Speed: 52111.75 samples/sec accuracy=0.935781
INFO:root:Epoch[0] Batch [500] Speed: 52394.11 samples/sec accuracy=0.946250
INFO:root:Epoch[0] Batch [600] Speed: 52185.51 samples/sec accuracy=0.947656
INFO:root:Epoch[0] Batch [700] Speed: 52354.36 samples/sec accuracy=0.953125
INFO:root:Epoch[0] Batch [800] Speed: 52180.64 samples/sec accuracy=0.958750
INFO:root:Epoch[0] Batch [900] Speed: 52312.02 samples/sec accuracy=0.953281
INFO:root:Epoch[0] Train-accuracy=0.955236
INFO:root:Epoch[0] Time cost=3.287
INFO:root:Epoch[0] Validation-accuracy=0.963774
==27259== Profiling application: python --network mlp --num-epochs 1
==27259== Profiling result:
No kernels were profiled.

Copy link

Thank you very much!

Copy link


Thank you for sharing this. I have a silly question actually, I want to know if the profiling time for example memcpy includes the time for the API call cudaMemcpy ??

Thank you

Copy link

jrevels commented Nov 1, 2019


To help out other MacOS users in case they run into the same problem I did:

Make sure to point nvvp to a supported version of the JRE:

/Developer/NVIDIA/CUDA-10.1/bin/nvvp -vm /Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home/jre/bin/java

Also make sure that the argument to -vm is an absolute path; it doesn't seem to understand relative paths.

Copy link

boydad commented Nov 13, 2019

@jrevels, Thank you!

There are also problems with the installation of the Cuda toolkit on Catalina. It can be solved by adding a link /Developer to any place. To do it on Catalina you need to add a line
Developer /Users/userName
to file /etc/synthetic.conf : like it is written here.

Copy link

Hello, iam having problem on viewing the visual profiler. Every time it open upon selecting a file it returns a error, saying "The application being profilled returned a non-zero code"

Copy link

Hello, do you know how to see GPU utilization with nvvp? I only see the duration of a function. But I want to also see the percentage of GPU utilization during a function.

Copy link

keke8273 commented Mar 4, 2020

i have to use the --profile-child-processes option to get the profile to work on windows.

Copy link

I get no outputs using nvprof. No matter if I give arguments or not.
Doesn't work with .exe file build with VS2019, nor using nvprof python

Can you please help?

Copy link

On Mac, nvvp prof.nvvp doesn't work.
An absolute path (instead of a relative path like "prof.nvvp") has to be provided.

Copy link

My recommendation to anyone who wants to install nvvp on an up-to-date Mac (catalina) is not to even try.
Firstly you have to disable gatekeeper:
sudo spctl --master-disable
but then after you download nvvp it won't run because it wants an outdated java runtime (JRE) that's hard to get or install.
I still haven't figured it out.
Currently I'm trying to figure out how to do this, but it looks like it's very very painful.

Copy link

Thanks man! Great help

Copy link

ccjjs commented Jun 12, 2021


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment