Skip to content

Instantly share code, notes, and snippets.

@joshlk
Last active March 29, 2024 06:06
Show Gist options
  • Star 20 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save joshlk/bbb1aca6e70b11d251886baee6423dcb to your computer and use it in GitHub Desktop.
Save joshlk/bbb1aca6e70b11d251886baee6423dcb to your computer and use it in GitHub Desktop.
Benchmark bandwidth and latency of P2P NVIDIA GPUs (NVLINK vs PCI)

NVIDIA GPU P2P Benchmark bandwidth/throughput and latency

Using https://github.com/NVIDIA/cuda-samples

You can also view the GPU topology using nvidia-smi topo -m

  1. Download repo git clone https://github.com/NVIDIA/cuda-samples.git
  2. Checkout the tag that corresponds with the right CUDA version: git checkout tags/v11.1
  3. You might need to install some additional packages sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev
  4. Either build everything by just execting make in root dir. Or cd Samples/p2pBandwidthLatencyTest; make
  5. Exectue: cd cuda-samples/bin/x86_64/linux/release; ./p2pBandwidthLatencyTest
root@neox-sid-0:~/cuda-samples/bin/x86_64/linux/release# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, A100-PCIE-40GB, pciBusID: 3b, pciDeviceID: 0, pciDomainID:0
Device: 1, A100-PCIE-40GB, pciBusID: 60, pciDeviceID: 0, pciDomainID:0
Device: 2, A100-PCIE-40GB, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device: 3, A100-PCIE-40GB, pciBusID: 86, pciDeviceID: 0, pciDomainID:0
Device: 4, A100-PCIE-40GB, pciBusID: da, pciDeviceID: 0, pciDomainID:0
Device: 5, A100-PCIE-40GB, pciBusID: db, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3 4 5
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 1
4 1 1 1 1 1 1
5 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5
0 1160.85 11.01 11.00 11.07 11.08 11.06
1 11.21 1165.18 9.72 11.19 11.14 11.20
2 11.23 9.88 1262.12 11.13 11.08 11.11
3 11.32 11.24 11.35 1293.46 11.12 11.14
4 11.23 11.26 11.24 11.35 1292.39 9.86
5 11.27 11.27 11.25 11.36 9.89 1291.32
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5
0 1277.60 10.05 10.05 275.34 8.55 8.55
1 10.04 1171.29 196.25 8.69 8.56 8.72
2 10.04 216.16 1218.80 9.86 8.70 8.50
3 200.30 8.93 8.93 1306.44 10.06 10.06
4 9.48 8.90 8.90 10.05 1304.26 189.87
5 9.61 8.90 8.90 10.05 275.24 1304.26
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5
0 1190.02 15.46 15.45 15.94 15.82 15.81
1 15.88 1306.98 10.44 15.90 15.71 15.74
2 15.85 10.41 1303.17 15.87 15.72 15.72
3 15.94 15.83 15.83 1305.89 15.83 15.72
4 15.86 15.70 15.71 15.88 1306.98 10.37
5 15.90 15.76 15.75 15.90 10.37 1308.08
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5
0 1305.89 19.35 19.34 516.02 18.80 18.85
1 19.34 1304.26 515.66 18.90 19.01 19.02
2 19.35 517.03 1306.98 18.90 19.01 19.01
3 515.34 18.91 18.92 1308.08 19.28 19.28
4 18.81 19.01 19.00 19.28 1305.89 515.37
5 18.82 19.00 19.02 19.28 517.31 1305.89
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5
0 2.37 19.64 19.84 18.85 19.41 19.69
1 16.54 2.46 20.35 13.72 21.74 21.14
2 18.13 20.62 2.26 18.76 21.45 21.75
3 18.45 20.22 20.38 2.28 19.36 19.29
4 19.62 19.32 17.96 18.82 2.30 20.61
5 19.24 20.98 20.66 20.54 20.40 2.28
CPU 0 1 2 3 4 5
0 2.75 9.85 9.74 10.10 10.18 10.31
1 9.95 2.67 9.42 9.97 10.04 9.84
2 9.50 9.69 2.82 9.96 9.96 10.04
3 10.10 9.92 9.98 3.15 10.49 10.32
4 10.07 9.87 10.24 10.34 3.03 10.51
5 10.29 9.89 10.07 10.25 10.54 3.16
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5
0 2.38 2.22 2.22 2.31 2.27 2.26
1 2.22 2.45 2.57 2.27 2.53 2.54
2 2.27 2.58 2.28 2.26 2.51 2.53
3 2.25 2.28 2.32 2.28 2.22 2.22
4 2.32 2.55 2.54 2.22 2.30 2.53
5 2.27 2.54 2.55 2.22 2.53 2.27
CPU 0 1 2 3 4 5
0 2.89 2.27 2.16 2.30 2.55 2.51
1 2.31 2.83 2.32 2.34 2.37 2.39
2 2.45 2.41 2.93 2.39 2.38 2.40
3 2.69 2.60 2.61 3.10 2.47 2.50
4 2.60 2.42 2.49 2.42 3.05 2.47
5 2.52 2.49 2.61 2.51 2.59 3.11
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
@dctanner
Copy link

dctanner commented Feb 8, 2024

From 11.8 the script is in Samples/5_Domain_Specific/p2pBandwidthLatencyTest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment