Skip to content

Instantly share code, notes, and snippets.

@XinDongol
Last active March 28, 2022 11:26
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save XinDongol/fe066cb76e1c5238ecbc0cb729806410 to your computer and use it in GitHub Desktop.
Save XinDongol/fe066cb76e1c5238ecbc0cb729806410 to your computer and use it in GitHub Desktop.
How to profile your pytorch codes

Inside profiler

import torch
import torchvision.models as models

model = models.densenet121(pretrained=True)
x = torch.randn((1, 3, 224, 224), requires_grad=True)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    model(x)
print(prof) 

The result is something like this,

-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                        CPU time        CUDA time            Calls        CPU total       CUDA total
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------
conv2d                                    9976.544us       9972.736us                1       9976.544us       9972.736us
convolution                               9958.778us       9958.400us                1       9958.778us       9958.400us
_convolution                              9946.712us       9947.136us                1       9946.712us       9947.136us
contiguous                                   6.692us          6.976us                1          6.692us          6.976us
empty                                       11.927us         12.032us                1         11.927us         12.032us
mkldnn_convolution                        9880.452us       9889.792us                1       9880.452us       9889.792us
batch_norm                                1214.791us       1213.440us                1       1214.791us       1213.440us
native_batch_norm                         1190.496us       1193.056us                1       1190.496us       1193.056us
threshold_                                 158.258us        159.584us                1        158.258us        159.584us
max_pool2d_with_indices                  28837.682us      28836.834us                1      28837.682us      28836.834us
max_pool2d_with_indices_forward          28813.804us      28822.530us                1      28813.804us      28822.530us
batch_norm                                1780.373us       1778.690us                1       1780.373us       1778.690us
native_batch_norm                         1756.774us       1759.327us                1       1756.774us       1759.327us
threshold_                                  64.665us         66.368us                1         64.665us         66.368us
conv2d                                    6103.544us       6102.142us                1       6103.544us       6102.142us
convolution                               6089.946us       6089.600us                1       6089.946us       6089.600us
_convolution                              6076.506us       6076.416us                1       6076.506us       6076.416us
contiguous                                   7.306us          7.938us                1          7.306us          7.938us
empty                                        9.037us          8.194us                1          9.037us          8.194us
mkldnn_convolution                        6015.653us       6021.408us                1       6015.653us       6021.408us
batch_norm                                 700.129us        699.394us        

You may find more details here

Inside Bottleneck

link

Python profiler

line_profiler

@ald2004
Copy link

ald2004 commented Sep 23, 2020

why cuda is slower then cpu

@MitchellX
Copy link

can I see per-layer memory usage? Instead of the summary one, like your image.

@nightlessbaron
Copy link

why cuda is slower then cpu

This might be because of the overhead which can be seen in GPUs compared to CPUs. Thus, on small models, CPUs run faster than GPUs. GPUs should be used on models which are complex and large.

@brando90
Copy link

does this work if the dataloader is involved too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment