Skip to content

Instantly share code, notes, and snippets.

@v0dro
Created July 12, 2021 00:31
Show Gist options
  • Save v0dro/0559e1d9b205ca9ac0ad1a29fe10377b to your computer and use it in GitHub Desktop.
Save v0dro/0559e1d9b205ca9ac0ad1a29fe10377b to your computer and use it in GitHub Desktop.
AMD NUMA configuration.

Table of Contents

  1. AMD NUMA configuration
  2. Solution

AMD NUMA configuration

The AMD EPYC machines have 8 memory channels. On a 32 core CPU like EPYC 7502, 8 consecutive cores share 2 channels and on a 64 core CPU like EPYC 7742, 16 cores share 2 channels.

For the ideal memory bandwidth to be reached, we must first configure the RAM sticks to use all the 8 channels by using about 32 GB memory per channel as is elaborated in this document.

However, using RAM slots in the ideal manner is not the ideal of making sure the ideal bandwidth is achieved. The AMD chips have the NPS* configuration that allows the user to control how many NUMA nodes the system memory is partitioned into. This document elaborates on the various NPS configurations for the AMD EPYC.

If you look at the DGX A100 CPU, you can see that the RAM is partitioned into 4 NUMA nodes, and using the thread affinity configuration of export GOMP_CPU_AFFINITY=0-63:2 leads to the following STREAM results:

size(KB)	SIZE			TIME(s)		CYCLES/VL	CYCLES		BANDWIDTH	THREADS
1048576		134217728		0.142958	8.52098		8		27.9802		1
1048576		134217728		0.121546	7.24473		8		32.9093		2
1048576		134217728		0.123639	7.36945		8		32.3523		4
1048576		134217728		0.107425	6.40305		4		37.2352		8
1048576		134217728		0.0859541	5.12327		4		46.5364		10
1048576		134217728		0.0716892	4.27301		4		55.7964		12
1048576		134217728		0.0537597	3.20432		0		74.4052		16
1048576		134217728		0.0430324	2.56493		0		92.9532		20
1048576		134217728		0.0359085	2.14031		0		111.394		24
1048576		134217728		0.0270584	1.6128		0		147.829		32

You can see that about 148 GBPS is reached for 32 threads using this configuration. The peak being about 160, this is pretty good.

On the other hand, the EPYC-7502 node is setup to use NPS1, which means that the entire RAM is a single NUMA node. This means that numactl or NUMA-aware memory management is not able to detect where to place the memory due to such a configuration. The STREAM TRIAD using export GOMP_CPU_AFFINITY=0-31:1 is as follows:

size(KB)	SIZE			TIME(s)		CYCLES/VL	CYCLES		BANDWIDTH	THREADS
1048576		134217728		0.134427	8.01247		8		29.7559		1
1048576		134217728		0.102203	6.0918		4		39.1376		2
1048576		134217728		0.10135		6.04093		4		39.4672		4
1048576		134217728		0.102667	6.11941		4		38.961		8
1048576		134217728		0.083575	4.98146		4		47.8612		10
1048576		134217728		0.071121	4.23914		4		56.2422		12
1048576		134217728		0.0560699	3.34203		4		71.3395		16
1048576		134217728		0.0477662	2.84709		0		83.7412		20
1048576		134217728		0.0419875	2.50265		0		95.2663		24
1048576		134217728		0.0351879	2.09736		0		113.675		32

Even though the peak memory bandwidth of both the machines is the same, the DGX A100 can reach more bandwidth due to the NPS4 configuration.

Solution

Change the BIOS settings of the EPYC 7502 server in order to use the NPS4 configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment