The AMD EPYC machines have 8 memory channels. On a 32 core CPU like EPYC 7502, 8 consecutive cores share 2 channels and on a 64 core CPU like EPYC 7742, 16 cores share 2 channels.
For the ideal memory bandwidth to be reached, we must first configure the RAM sticks to use all the 8 channels by using about 32 GB memory per channel as is elaborated in this document.
However, using RAM slots in the ideal manner is not the ideal of making sure the ideal bandwidth is achieved. The AMD chips have the NPS* configuration that allows the user to control how many NUMA nodes the system memory is partitioned into. This document elaborates on the various NPS configurations for the AMD EPYC.
If you look at the DGX A100 CPU, you can see that the RAM is partitioned into
4 NUMA nodes, and using the thread affinity configuration of
export GOMP_CPU_AFFINITY=0-63:2
leads to the following STREAM results:
size(KB) SIZE TIME(s) CYCLES/VL CYCLES BANDWIDTH THREADS
1048576 134217728 0.142958 8.52098 8 27.9802 1
1048576 134217728 0.121546 7.24473 8 32.9093 2
1048576 134217728 0.123639 7.36945 8 32.3523 4
1048576 134217728 0.107425 6.40305 4 37.2352 8
1048576 134217728 0.0859541 5.12327 4 46.5364 10
1048576 134217728 0.0716892 4.27301 4 55.7964 12
1048576 134217728 0.0537597 3.20432 0 74.4052 16
1048576 134217728 0.0430324 2.56493 0 92.9532 20
1048576 134217728 0.0359085 2.14031 0 111.394 24
1048576 134217728 0.0270584 1.6128 0 147.829 32
You can see that about 148 GBPS is reached for 32 threads using this configuration. The peak being about 160, this is pretty good.
On the other hand, the EPYC-7502 node is setup to use NPS1, which means that the
entire RAM is a single NUMA node. This means that numactl
or NUMA-aware memory
management is not able to detect where to place the memory due to such a configuration.
The STREAM TRIAD using export GOMP_CPU_AFFINITY=0-31:1
is as follows:
size(KB) SIZE TIME(s) CYCLES/VL CYCLES BANDWIDTH THREADS
1048576 134217728 0.134427 8.01247 8 29.7559 1
1048576 134217728 0.102203 6.0918 4 39.1376 2
1048576 134217728 0.10135 6.04093 4 39.4672 4
1048576 134217728 0.102667 6.11941 4 38.961 8
1048576 134217728 0.083575 4.98146 4 47.8612 10
1048576 134217728 0.071121 4.23914 4 56.2422 12
1048576 134217728 0.0560699 3.34203 4 71.3395 16
1048576 134217728 0.0477662 2.84709 0 83.7412 20
1048576 134217728 0.0419875 2.50265 0 95.2663 24
1048576 134217728 0.0351879 2.09736 0 113.675 32
Even though the peak memory bandwidth of both the machines is the same, the DGX A100 can reach more bandwidth due to the NPS4 configuration.
Change the BIOS settings of the EPYC 7502 server in order to use the NPS4 configuration.