Skip to content

Instantly share code, notes, and snippets.

@zrruziev
Last active November 19, 2024 03:24
Show Gist options
  • Save zrruziev/b93e1292bf2ee39284f834ec7397ee9f to your computer and use it in GitHub Desktop.
Save zrruziev/b93e1292bf2ee39284f834ec7397ee9f to your computer and use it in GitHub Desktop.
Fixing "successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero" problem

What is NUMA (Non-Uniformed Memory Access)

Non-Uniform Memory Access (NUMA) is one of the computer memory design methods used in multiprocessor systems, and the time to access the memory varies depending on the relative position between the memory and the processor. In the NUMA architecture, when a processor accesses its local memory, it is faster than when it accesses the remote memory. Remote memory refers to memory that is connected to another processor, and local memory refers to memory that is connected to its own processor. In other words, it is a technology to increase memory access efficiency while using multiple processors on one motherboard. When a specific processor runs out of memory, it monopolizes the bus by itself, so other processors have to play. , and designate 'access only here', and call it a NUMA node.

1. Check Nodes

lspci | grep -i nvidia
  
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 12GB] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU106 High Definition Audio Controller (rev a1)

The first line shows the address of the VGA compatible device, NVIDIA Geforce, as 01:00 . Each one will be different, so let's change this part carefully.

2. Check and change NUMA setting values

If you go to /sys/bus/pci/devicecs/, you can see the following list:

ls /sys/bus/pci/devices/
  
0000:00:00.0  0000:00:06.0  0000:00:15.0  0000:00:1c.0  0000:00:1f.3  0000:00:1f.6  0000:02:00.0
0000:00:01.0  0000:00:14.0  0000:00:16.0  0000:00:1d.0  0000:00:1f.4  0000:01:00.0
0000:00:02.0  0000:00:14.2  0000:00:17.0  0000:00:1f.0  0000:00:1f.5  0000:01:00.1

01:00.0 checked above is visible. However, 0000: is attached in front.

3. Check if it is connected.

cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node
  
-1

-1 means no connection, 0 means connected.

4. Fix it with the command below.

sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:01\:00.0/numa_node
  
0

It shows 0 which means connected!

5. Check again:

cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node
  
0

That's it!

@DarkShadowxx
Copy link

Thank you!

@AlexLandauer
Copy link

Thanks for the script -- here's a version for a multi-GPU system:

#!/bin/bash
if [[ "$EUID" -ne 0 ]]; then
  echo "Please run as root."
  exit 1
fi
PCI_ID=$(lspci | grep "VGA compatible controller: NVIDIA Corporation" | cut -d' ' -f1)
#PCI_ID="0000:$PCI_ID"
for item in $PCI_ID
do
  item="0000:$item"
  FILE=/sys/bus/pci/devices/$item/numa_node
  echo Checking $FILE for NUMA connection status...
  if [[ -f "$FILE" ]]; then
    CURRENT_VAL=$(cat $FILE)
    if [[ "$CURRENT_VAL" -eq -1 ]]; then
      echo Setting connection value from -1 to 0.	  
      echo 0 > $FILE
    else
      echo Current connection value of $CURRENT_VAL is not -1.
    fi  
  else
    echo $FILE does not exist to update.
  fi
done

@AlexLandauer
Copy link

AlexLandauer commented Sep 25, 2023 via email

@Eugeniusz-Gienek
Copy link

For some reason for me (Gentoo) I had a slightly different path so I had to comment out the string
#item="0000:$item"
Hope that helps someone.

@onlineapps-cloud
Copy link

hi, i have a problem, my value after rebooting is restoring to -1
i made:

lspci | grep -i nvidia
06:10.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
06:10.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)

after i made changes:
sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:06\:10.0/numa_node

after i check:
cat /sys/bus/pci/devices/0000\:06\:10.0/numa_node i get 0

after reboot and execute again cat /sys/bus/pci/devices/0000\:06\:10.0/numa_node i get -1, how to made this value to be persistent after reboot?

OS ubuntu 22.04, thanks, best regards

@ZeroX29a
Copy link

ZeroX29a commented Apr 6, 2024

Thank you so muuch
This should be first in google,
for SEO
[solved] tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node

@mikechen66
Copy link

mikechen66 commented May 3, 2024

It helps just one time, i.e., show o once upon making the change, however, it is not a persistent solution. After rebooting, it still shows "successful NUMA node read from SysFS had negative value (-1),..."

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2024-05-03 17:14:58.945528: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-03 17:15:00.791206: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-05-03 17:15:02.618108: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-03 17:15:03.056970: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-03 17:15:03.057654: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

After checking it again, it shows the same error as follows.

cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node

-1

Please help to solve the problem.

Thanks

@wyike
Copy link

wyike commented May 8, 2024

Thanks a lot for the manual fix!

If this value is -1 in the file, will it cause any problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment