Skip to content

Instantly share code, notes, and snippets.

@zfb132
Forked from zrruziev/NUMA node problem.md
Created June 12, 2023 06:38
Show Gist options
  • Save zfb132/b6f6cbde2758ed59b6dcbf198a543e56 to your computer and use it in GitHub Desktop.
Save zfb132/b6f6cbde2758ed59b6dcbf198a543e56 to your computer and use it in GitHub Desktop.
Fixing "successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero" problem

What is NUMA (Non-Uniformed Memory Access)

Non-Uniform Memory Access (NUMA) is one of the computer memory design methods used in multiprocessor systems, and the time to access the memory varies depending on the relative position between the memory and the processor. In the NUMA architecture, when a processor accesses its local memory, it is faster than when it accesses the remote memory. Remote memory refers to memory that is connected to another processor, and local memory refers to memory that is connected to its own processor. In other words, it is a technology to increase memory access efficiency while using multiple processors on one motherboard. When a specific processor runs out of memory, it monopolizes the bus by itself, so other processors have to play. , and designate 'access only here', and call it a NUMA node.

1. Check Nodes

lspci | grep -i nvidia
  
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 12GB] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU106 High Definition Audio Controller (rev a1)

The first line shows the address of the VGA compatible device, NVIDIA Geforce, as 01:00 . Each one will be different, so let's change this part carefully.

2. Check and change NUMA setting values

If you go to /sys/bus/pci/devicecs/, you can see the following list:

ls /sys/bus/pci/devices/
  
0000:00:00.0  0000:00:06.0  0000:00:15.0  0000:00:1c.0  0000:00:1f.3  0000:00:1f.6  0000:02:00.0
0000:00:01.0  0000:00:14.0  0000:00:16.0  0000:00:1d.0  0000:00:1f.4  0000:01:00.0
0000:00:02.0  0000:00:14.2  0000:00:17.0  0000:00:1f.0  0000:00:1f.5  0000:01:00.1

01:00.0 checked above is visible. However, 0000: is attached in front.

3. Check if it is connected.

cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node
  
-1

-1 means no connection, 0 means connected.

4. Fix it with the command below.

sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:01\:00.0/numa_node
  
0

It shows 0 which means connected!

5. Check again:

cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node
  
0

That's it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment