Skip to content

Instantly share code, notes, and snippets.

@augustin-laurent
Last active February 12, 2025 15:57
Show Gist options
  • Save augustin-laurent/d29f026cdb53a4dff50a400c129d3ea7 to your computer and use it in GitHub Desktop.
Save augustin-laurent/d29f026cdb53a4dff50a400c129d3ea7 to your computer and use it in GitHub Desktop.
ROCm Installation guide on Arch
Date of the guide : December 17, 2023

Introduction

In this post, I will provide the solution that worked on my system on how to install Radeon Open Compute (ROCm) on Arch (linux-6.6.7.arch1-1) for RX 6900 XT (Should work on other 6000 series). ROCm is an open-source software platform that allows GPU-accelerated computation. This tool is a prerequist to use GPU Acceleration on TensorFlow or PyTorch. In this guide I will use Paru as my AUR package helper, feel free to use any other (https://wiki.archlinux.org/title/AUR_helpers). I will assume you have a working operating system and know what you do with it (Otherwise Arch will be painfull for you).

Prerequisites

  • A computer running Arch
  • Access to AUR
  • A compatible AMD GPU

Step-by-Step Guide

  1. Update Your System: First, make sure your system is up-to-date. Open your terminal and run:

    sudo pacman -Syu
    paru -Syu (or yay -Syu, depends on which AUR package helper you use)
  2. Install prerequisites: You will need some packages to fetch and compile ROCm In the terminal run :

    paru -S wget make curl gperftools
  3. Install PyEnv: I choose to install PyEnv to manage my Python version, you can directly install Python if the version you use is compatible with ROCm

    curl https://pyenv.run | bash
    ``
    
    Add theses lines to your .bashrc (located in your /home/username folder) : 
    
    ```bash
    vim ~/.bashrc (or nano ~/.bashrc)
    export PATH="$HOME/.pyenv/bin:$PATH"
    eval "$(pyenv init --path)"
    eval "$(pyenv virtualenv-init -)"

    Now refresh the shell by either closing and reopening your terminal or execute this :

    exec $SHELL

    Now to be sure PyEnv is installed run :

    pyenv

    If the commands list is printed in your terminal, then you managed to installed otherwise go to PyEnv wiki to try fix the install : (https://github.com/pyenv/pyenv/wiki)

  4. Install Python:

    Now that we have PyEnv we can install Python, in this guide, I will use Python 3.10.13 (Lateset version supported at the date I'm writing this guide).

    pyenv install 3.10.13

    PyEnv have installed the 3.10.13 version, now we need to tell our system to use this version.

    pyenv global 3.10.13

    To ensure we have the right version execute :

    python --version

    If the command return 3.10.13 then you have the version you just installed as the version your system will use.

  5. Install ROCm:

    Now that we have everything setup we can install ROCm, in your terminal run :

    paru -S rocm-hip-sdk rocm-opencl-sdk

    You now have ROCm installed, but we need a bit more step to make it work.

  6. Configuring stuff:

    You will need to had your session to user groups.

    sudo gpasswd -a username render
    sudo gpasswd -a username video

    Then you will have to edit .bashrc again, add this :

    export ROCM_PATH=/opt/rocm
    export HSA_OVERRIDE_GFX_VERSION=10.3.0

    If you have a GPU that are from 7XXX series, then you need to change the 10.3.0 value to 11.0.0. From there you basically should have a working environnement with ROCm, next steps is try it.

  7. Testing ROCm with Tensorflow:

    Now simply install the library with pip.

    pip install --user tensorflow-rocm
    git clone https://github.com/mpeschel10/test-tensorflow-rocm.git

    CD into the folder just cloned.

    cd test-tensorflow-rocm

    And run the .py file

    python test_tensorflow.py

    If it's running 5 Epochs and printing the time your GPU made to pass the test, congratulations ! You now have a working ROCm environnement ! For numbers here are my result with a RX 6900XT :

    313/313 - 0s - loss: 0.0657 - accuracy: 0.9808 - 249ms/epoch - 795us/step
    Your run took 9.433697 seconds.
    // This stat below are from the author of the test.
    My GPU takes 14 seconds.
    My CPU takes 74 seconds.
    Your mileage may vary.
@yjcb22
Copy link

yjcb22 commented Oct 24, 2024

@augustin-laurent Greetings from Tampa-Florida! You have a typo here:

paru -S rocm-hip-sdk rocm-opencl-sdkk

You have an extra k in the word sdk which is making the packager manager not to find the package.

@augustin-laurent
Copy link
Author

@augustin-laurent Greetings from Tampa-Florida! You have a typo here:

paru -S rocm-hip-sdk rocm-opencl-sdkk

You have an extra k in the word sdk which is making the packager manager not to find the package.

You are right, correction will be made following this response, does the guide still works for you ?

@yjcb22
Copy link

yjcb22 commented Oct 24, 2024

yes! It worked! I realized the error and corrected it. It worked perfectly!

@hyprbased
Copy link

just curious if you know how to install rocm 6.3 with hip support ( from source i guess)
your method looks like its for the older 6.2

thanks

@augustin-laurent
Copy link
Author

just curious if you know how to install rocm 6.3 with hip support ( from source i guess) your method looks like its for the older 6.2

thanks

I suggest waiting for the AUR maintainer to update from 6.2 to 6.3, as Arch isn't officially supported by AMD.
Updating manually could cause system instability and even screw up your install, contact the repo maintainer on AUR if he have something.
Since I no longer have an AMD GPU, I can't offer specific troubleshooting.

@yaayimanalien
Copy link

running the python script caused an error:

[winter@catchy test-tensorflow-rocm]$ python test_tensorflow.py
2025-02-01 14:43:06.812871: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/home/winter/git/test-tensorflow-rocm/test_tensorflow.py", line 9, in <module>
    import tensorflow as tf
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python.saved_model import saved_model
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/saved_model.py", line 20, in <module>
    from tensorflow.python.saved_model import builder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder.py", line 23, in <module>
    from tensorflow.python.saved_model.builder_impl import _SavedModelBuilder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder_impl.py", line 26, in <module>
    from tensorflow.python.framework import dtypes
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/framework/dtypes.py", line 35, in <module>
    from tensorflow.tsl.python.lib.core import pywrap_ml_dtypes
AttributeError: _ARRAY_API not found
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
Traceback (most recent call last):
  File "/home/winter/git/test-tensorflow-rocm/test_tensorflow.py", line 9, in <module>
    import tensorflow as tf
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python.saved_model import saved_model
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/saved_model.py", line 20, in <module>
    from tensorflow.python.saved_model import builder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder.py", line 23, in <module>
    from tensorflow.python.saved_model.builder_impl import _SavedModelBuilder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder_impl.py", line 26, in <module>
    from tensorflow.python.framework import dtypes
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/framework/dtypes.py", line 37, in <module>
    _np_bfloat16 = pywrap_ml_dtypes.bfloat16()
TypeError: Unable to convert function return value to a Python type! The signature was
        () -> handle

@leboural
Copy link

leboural commented Feb 1, 2025

running the python script caused an error:

[winter@catchy test-tensorflow-rocm]$ python test_tensorflow.py
2025-02-01 14:43:06.812871: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/home/winter/git/test-tensorflow-rocm/test_tensorflow.py", line 9, in <module>
    import tensorflow as tf
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python.saved_model import saved_model
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/saved_model.py", line 20, in <module>
    from tensorflow.python.saved_model import builder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder.py", line 23, in <module>
    from tensorflow.python.saved_model.builder_impl import _SavedModelBuilder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder_impl.py", line 26, in <module>
    from tensorflow.python.framework import dtypes
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/framework/dtypes.py", line 35, in <module>
    from tensorflow.tsl.python.lib.core import pywrap_ml_dtypes
AttributeError: _ARRAY_API not found
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
Traceback (most recent call last):
  File "/home/winter/git/test-tensorflow-rocm/test_tensorflow.py", line 9, in <module>
    import tensorflow as tf
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python.saved_model import saved_model
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/saved_model.py", line 20, in <module>
    from tensorflow.python.saved_model import builder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder.py", line 23, in <module>
    from tensorflow.python.saved_model.builder_impl import _SavedModelBuilder
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/saved_model/builder_impl.py", line 26, in <module>
    from tensorflow.python.framework import dtypes
  File "/home/winter/.local/lib/python3.10/site-packages/tensorflow/python/framework/dtypes.py", line 37, in <module>
    _np_bfloat16 = pywrap_ml_dtypes.bfloat16()
TypeError: Unable to convert function return value to a Python type! The signature was
        () -> handle

Tout pareil ici ! the same for me !

@augustin-laurent
Copy link
Author

@leboural @yaayimanalien Could you please provide more information, like the Tensorflow version you installed, and the ROCm version ? If you can, try to downgrade to the previous version of Tensorflow.
I will try to assist you as much as I can.

@leboural
Copy link

leboural commented Feb 1, 2025

@leboural @yaayimanalien Could you please provide more information, like the Tensorflow version you installed, and the ROCm version ? If you can, try to downgrade to the previous version of Tensorflow. I will try to assist you as much as I can.

ROCM is 6.2.2-1 Tensorflow-rocm 2.14.0.600

@augustin-laurent
Copy link
Author

@leboural @yaayimanalien Could you please provide more information, like the Tensorflow version you installed, and the ROCm version ? If you can, try to downgrade to the previous version of Tensorflow. I will try to assist you as much as I can.

ROCM is 6.2.2-1 Tensorflow-rocm 2.14.0.600

https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/native_linux/install-tensorflow.html

Note
The latest version of Python module numpy v2.0 is incompatible with the TensorFlow wheels for this version. Downgrade to an older version is required.
Example: pip3 install numpy==1.26.4

Try this, I linked the sources on why you should downgrade numpy.

@augustin-laurent
Copy link
Author

It works ! Thanks a lot

No problem, bon après-midi et bon week-end

@leboural
Copy link

leboural commented Feb 1, 2025

Now i have this :
"2025-02-01 16:27:59.572852: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2025-02-01 16:27:59.674860: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-02-01 16:28:00.695643: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-02-01 16:28:00.695685: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
python: /usr/src/debug/hip-runtime/clr-rocm-6.2.4/hipamd/src/hip_code_object.cpp:1152: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed."

@augustin-laurent
Copy link
Author

Now i have this : "2025-02-01 16:27:59.572852: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2025-02-01 16:27:59.674860: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-02-01 16:28:00.695643: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2025-02-01 16:28:00.695685: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero python: /usr/src/debug/hip-runtime/clr-rocm-6.2.4/hipamd/src/hip_code_object.cpp:1152: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed."

The first entry are completely normal, these are just logs, you can silent them by adding a environnement variable, I let you search for it.

The second is weird, I suggest first try to reinstall the ROCm package, then reboot your PC to ensure everything is up.

@leboural
Copy link

leboural commented Feb 1, 2025

Now i have this : "2025-02-01 16:27:59.572852: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2025-02-01 16:27:59.674860: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-02-01 16:28:00.695643: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2025-02-01 16:28:00.695685: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero python: /usr/src/debug/hip-runtime/clr-rocm-6.2.4/hipamd/src/hip_code_object.cpp:1152: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed."

The first entry are completely normal, these are just logs, you can silent them by adding a environnement variable, I let you search for it.

The second is weird, I suggest first try to reinstall the ROCm package, then reboot your PC to ensure everything is up.

It's weird because i can execute a rocminfo without troubles !

@nils-affentranger
Copy link

nils-affentranger commented Feb 9, 2025

I'm having the exact same problem as @leboural

2025-02-09 18:25:16.932652: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-09 18:25:16.957448: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-02-09 18:25:17.817533: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
python: /usr/src/debug/hip-runtime/clr-rocm-6.2.4/hipamd/src/hip_code_object.cpp:1152: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
Aborted (core dumped)

GPU: RX 7900 XTX
rocm-hip-sdk 6.2.2-1
rocm-opencl-sdk 6.2.2-1

@ThomasMartin83
Copy link

For some reason got the same error while trying to run OpenSeeFace through rocm
Error:
*************** EP Error ***************
EP Error /onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_migraphx.so with error: libmigraphx_c.so.3: cannot open shared object file: No such file or directory
when using ['MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider']
Falling back to ['ROCMExecutionProvider', 'CPUExecutionProvider'] and retrying.


python: /usr/src/debug/hip-runtime/clr-rocm-6.2.4/hipamd/src/hip_code_object.cpp:1152: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
Aborted (core dumped)

Found this on arch forum:
https://gitlab.archlinux.org/archlinux/packaging/packages/hip-runtime/-/issues/1

They said solved by disabling assertions in the old package hip-runtime-amd

But I'm relatively new to all that staff, so can't make sense of what to do exactly

@augustin-laurent
Copy link
Author

augustin-laurent commented Feb 12, 2025

@ThomasMartin83 Sadly, I can't reproduce your error I'm now using an Nvidia GPU, I suggest to downgrade to a version know to work without issue and wait for a fix from the maintainer of the AUR package.
I will look closely to this issue and update this guide if there is new steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment