|
root@neox-sid-0:~/cuda-samples/bin/x86_64/linux/release# ./p2pBandwidthLatencyTest |
|
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] |
|
Device: 0, A100-PCIE-40GB, pciBusID: 3b, pciDeviceID: 0, pciDomainID:0 |
|
Device: 1, A100-PCIE-40GB, pciBusID: 60, pciDeviceID: 0, pciDomainID:0 |
|
Device: 2, A100-PCIE-40GB, pciBusID: 61, pciDeviceID: 0, pciDomainID:0 |
|
Device: 3, A100-PCIE-40GB, pciBusID: 86, pciDeviceID: 0, pciDomainID:0 |
|
Device: 4, A100-PCIE-40GB, pciBusID: da, pciDeviceID: 0, pciDomainID:0 |
|
Device: 5, A100-PCIE-40GB, pciBusID: db, pciDeviceID: 0, pciDomainID:0 |
|
Device=0 CAN Access Peer Device=1 |
|
Device=0 CAN Access Peer Device=2 |
|
Device=0 CAN Access Peer Device=3 |
|
Device=0 CAN Access Peer Device=4 |
|
Device=0 CAN Access Peer Device=5 |
|
Device=1 CAN Access Peer Device=0 |
|
Device=1 CAN Access Peer Device=2 |
|
Device=1 CAN Access Peer Device=3 |
|
Device=1 CAN Access Peer Device=4 |
|
Device=1 CAN Access Peer Device=5 |
|
Device=2 CAN Access Peer Device=0 |
|
Device=2 CAN Access Peer Device=1 |
|
Device=2 CAN Access Peer Device=3 |
|
Device=2 CAN Access Peer Device=4 |
|
Device=2 CAN Access Peer Device=5 |
|
Device=3 CAN Access Peer Device=0 |
|
Device=3 CAN Access Peer Device=1 |
|
Device=3 CAN Access Peer Device=2 |
|
Device=3 CAN Access Peer Device=4 |
|
Device=3 CAN Access Peer Device=5 |
|
Device=4 CAN Access Peer Device=0 |
|
Device=4 CAN Access Peer Device=1 |
|
Device=4 CAN Access Peer Device=2 |
|
Device=4 CAN Access Peer Device=3 |
|
Device=4 CAN Access Peer Device=5 |
|
Device=5 CAN Access Peer Device=0 |
|
Device=5 CAN Access Peer Device=1 |
|
Device=5 CAN Access Peer Device=2 |
|
Device=5 CAN Access Peer Device=3 |
|
Device=5 CAN Access Peer Device=4 |
|
|
|
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. |
|
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. |
|
|
|
P2P Connectivity Matrix |
|
D\D 0 1 2 3 4 5 |
|
0 1 1 1 1 1 1 |
|
1 1 1 1 1 1 1 |
|
2 1 1 1 1 1 1 |
|
3 1 1 1 1 1 1 |
|
4 1 1 1 1 1 1 |
|
5 1 1 1 1 1 1 |
|
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) |
|
D\D 0 1 2 3 4 5 |
|
0 1160.85 11.01 11.00 11.07 11.08 11.06 |
|
1 11.21 1165.18 9.72 11.19 11.14 11.20 |
|
2 11.23 9.88 1262.12 11.13 11.08 11.11 |
|
3 11.32 11.24 11.35 1293.46 11.12 11.14 |
|
4 11.23 11.26 11.24 11.35 1292.39 9.86 |
|
5 11.27 11.27 11.25 11.36 9.89 1291.32 |
|
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) |
|
D\D 0 1 2 3 4 5 |
|
0 1277.60 10.05 10.05 275.34 8.55 8.55 |
|
1 10.04 1171.29 196.25 8.69 8.56 8.72 |
|
2 10.04 216.16 1218.80 9.86 8.70 8.50 |
|
3 200.30 8.93 8.93 1306.44 10.06 10.06 |
|
4 9.48 8.90 8.90 10.05 1304.26 189.87 |
|
5 9.61 8.90 8.90 10.05 275.24 1304.26 |
|
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) |
|
D\D 0 1 2 3 4 5 |
|
0 1190.02 15.46 15.45 15.94 15.82 15.81 |
|
1 15.88 1306.98 10.44 15.90 15.71 15.74 |
|
2 15.85 10.41 1303.17 15.87 15.72 15.72 |
|
3 15.94 15.83 15.83 1305.89 15.83 15.72 |
|
4 15.86 15.70 15.71 15.88 1306.98 10.37 |
|
5 15.90 15.76 15.75 15.90 10.37 1308.08 |
|
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) |
|
D\D 0 1 2 3 4 5 |
|
0 1305.89 19.35 19.34 516.02 18.80 18.85 |
|
1 19.34 1304.26 515.66 18.90 19.01 19.02 |
|
2 19.35 517.03 1306.98 18.90 19.01 19.01 |
|
3 515.34 18.91 18.92 1308.08 19.28 19.28 |
|
4 18.81 19.01 19.00 19.28 1305.89 515.37 |
|
5 18.82 19.00 19.02 19.28 517.31 1305.89 |
|
P2P=Disabled Latency Matrix (us) |
|
GPU 0 1 2 3 4 5 |
|
0 2.37 19.64 19.84 18.85 19.41 19.69 |
|
1 16.54 2.46 20.35 13.72 21.74 21.14 |
|
2 18.13 20.62 2.26 18.76 21.45 21.75 |
|
3 18.45 20.22 20.38 2.28 19.36 19.29 |
|
4 19.62 19.32 17.96 18.82 2.30 20.61 |
|
5 19.24 20.98 20.66 20.54 20.40 2.28 |
|
|
|
CPU 0 1 2 3 4 5 |
|
0 2.75 9.85 9.74 10.10 10.18 10.31 |
|
1 9.95 2.67 9.42 9.97 10.04 9.84 |
|
2 9.50 9.69 2.82 9.96 9.96 10.04 |
|
3 10.10 9.92 9.98 3.15 10.49 10.32 |
|
4 10.07 9.87 10.24 10.34 3.03 10.51 |
|
5 10.29 9.89 10.07 10.25 10.54 3.16 |
|
P2P=Enabled Latency (P2P Writes) Matrix (us) |
|
GPU 0 1 2 3 4 5 |
|
0 2.38 2.22 2.22 2.31 2.27 2.26 |
|
1 2.22 2.45 2.57 2.27 2.53 2.54 |
|
2 2.27 2.58 2.28 2.26 2.51 2.53 |
|
3 2.25 2.28 2.32 2.28 2.22 2.22 |
|
4 2.32 2.55 2.54 2.22 2.30 2.53 |
|
5 2.27 2.54 2.55 2.22 2.53 2.27 |
|
|
|
CPU 0 1 2 3 4 5 |
|
0 2.89 2.27 2.16 2.30 2.55 2.51 |
|
1 2.31 2.83 2.32 2.34 2.37 2.39 |
|
2 2.45 2.41 2.93 2.39 2.38 2.40 |
|
3 2.69 2.60 2.61 3.10 2.47 2.50 |
|
4 2.60 2.42 2.49 2.42 3.05 2.47 |
|
5 2.52 2.49 2.61 2.51 2.59 3.11 |
|
|
|
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. |
From 11.8 the script is in Samples/5_Domain_Specific/p2pBandwidthLatencyTest