Skip to content

Instantly share code, notes, and snippets.

@mkolod
Last active April 28, 2020 18:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mkolod/37f8016bf990f9768599a859f5b538f7 to your computer and use it in GitHub Desktop.
Save mkolod/37f8016bf990f9768599a859f5b538f7 to your computer and use it in GitHub Desktop.
import torch
from time import time
# initialize CUDA, to not count startup later
foo = torch.ones(1).cuda()
MB = 1 << 20
# 27 MB tensor
NUM_MB = 27
# Floats
SIZEOF_DTYPE = 4
TENSOR_SIZE = int(NUM_MB * MB / SIZEOF_DTYPE)
THEORETICAL_V3_X16 = 15.75 * (1 << 30) / MB
PIN_MEMORY = True
data = torch.randn(TENSOR_SIZE)
if PIN_MEMORY:
data = data.pin_memory()
# unnecessary here, but kept to not forget
# if we schedule async work before
torch.cuda.synchronize()
time.sleep(1)
start = time()
# This is blocking, so timing will be correct after that
data = data.cuda()
duration = time() - start
print("Copy duration: {:.2f} ms".format(duration * 1000))
effective_bw = NUM_MB / duration
print("Effective Bandwidth: {:.2f} MB/s".format(effective_bw))
pct_theoretical = effective_bw / THEORETICAL_V3_X16 * 100
print("Percent theoretical PCIe v3 x16 bandwidth: {:.2f}".format(pct_theoretical))
@mkolod
Copy link
Author

mkolod commented Apr 26, 2020

Sample run:

Copy duration: 3.03 ms
Effective Bandwidth: 8922.64 MB/s
Percent theoretical PCIe v3 x16 bandwidth: 55.32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment