Skip to content

Instantly share code, notes, and snippets.

@guilt
Last active July 25, 2024 05:09
Show Gist options
  • Save guilt/6c901f7ac0a726685b6334798da77c00 to your computer and use it in GitHub Desktop.
Save guilt/6c901f7ac0a726685b6334798da77c00 to your computer and use it in GitHub Desktop.
ROCM Setup Steps
#!/bin/sh
set -e
mkdir -p pytorch-examples pytorch-cache
docker build pytorch-examples \
--rm \
-t rocm-examples-pytorch \
-f post-rocm-python-ubuntu.Dockerfile
#!/bin/sh
set -e
LOCAL_GID=$(getent group render | cut -d: -f3)
mkdir -p rocm-examples
exec docker build rocm-examples \
--build-arg GID="${LOCAL_GID}" \
--rm \
-t rocm-examples \
-f hip-libraries-rocm-ubuntu.Dockerfile
# Ubuntu based docker image
FROM ubuntu:20.04
# Base packages that are required for the installation
RUN export DEBIAN_FRONTEND=noninteractive; \
apt-get update -qq \
&& apt-get install --no-install-recommends -y \
ca-certificates \
git \
locales-all \
make \
python3 \
python3-venv \
python3-dev \
ssh \
sudo \
wget \
pkg-config \
glslang-tools \
libvulkan-dev \
vulkan-validationlayers \
libglfw3-dev \
neovim \
&& rm -rf /var/lib/apt/lists/*
ENV LANG en_US.utf8
# Install ROCM HIP and libraries using the installer script
RUN export DEBIAN_FRONTEND=noninteractive; \
wget https://repo.radeon.com/amdgpu-install/5.4.3/ubuntu/focal/amdgpu-install_5.4.50403-1_all.deb \
&& apt-get update -qq \
&& apt-get install -y ./amdgpu-install_5.4.50403-1_all.deb \
&& rm ./amdgpu-install_5.4.50403-1_all.deb \
&& amdgpu-install -y --usecase=hiplibsdk --no-dkms \
&& apt-get install -y libnuma-dev \
&& rm -rf /var/lib/apt/lists/*
# Install CMake
RUN wget https://github.com/Kitware/CMake/releases/download/v3.21.7/cmake-3.21.7-linux-x86_64.sh \
&& mkdir /cmake \
&& sh cmake-3.21.7-linux-x86_64.sh --skip-license --prefix=/cmake \
&& rm cmake-3.21.7-linux-x86_64.sh
ENV PATH="/cmake/bin:/opt/rocm/bin:${PATH}"
RUN echo "/opt/rocm/lib" >> /etc/ld.so.conf.d/rocm.conf \
&& ldconfig
# Use render group as an argument from user
ARG GID=109
# Add the render group and a user with sudo permissions for the container
RUN groupadd --system --gid ${GID} render \
&& useradd -Um -G sudo,video,render developer \
&& echo developer ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/developer \
&& chmod 0440 /etc/sudoers.d/developer
RUN mkdir /workspaces && chown developer:developer /workspaces
WORKDIR /workspaces
VOLUME /workspaces
USER developer
#!/bin/sh
# shellcheck disable=SC2086
set -e
DOCKER_ARGS=${DOCKER_ARGS:--it}
CMD=${1:-bash}
exec docker run \
--rm \
${DOCKER_ARGS} \
--name rocm-examples-pytorch \
-h rocm-examples-pytorch \
--device /dev/kfd --device /dev/dri \
-v "$(pwd -P)/pytorch-examples":/workspaces/pytorch-examples \
-v "$(pwd -P)/pytorch-cache":/home/developer/.cache \
rocm-examples-pytorch "$CMD"
#!/bin/sh
# shellcheck disable=SC2086
set -e
DOCKER_ARGS=${DOCKER_ARGS:--it}
CMD=${1:-bash}
exec docker run \
--rm \
${DOCKER_ARGS} \
--name rocm-examples \
-h rocm-examples \
--device /dev/kfd --device /dev/dri \
-v "$(pwd -P)/rocm-examples":/workspaces/rocm-examples \
rocm-examples "$CMD"
# ROCm based docker image
FROM rocm-examples:latest
# Set Root User
USER root
# Create VEnv Directory
RUN mkdir -p /venv && chown developer:developer /venv
# Set User
USER developer
# Install VEnv and PyTorch
RUN python3.8 -m venv /venv && \
. /venv/bin/activate && \
python3.8 -m pip install --upgrade \
pip setuptools wheel six && \
python3.8 -m pip install \
--index-url https://download.pytorch.org/whl/rocm5.4.2 \
--pre torch torchvision torchaudio pillow && \
python3.8 -m pip cache purge
VOLUME /venv
CMD /venv/bin/python3.8

ROCm Setup Steps

ROCm Docker Image

  1. Install Docker and ensure you can run docker ps correctly, add yourself to the docker group if necessary.
  2. Run build-rocm.sh builds a ROCm docker image for your Linux System. It is configured to use the render group configured in your Linux distribution and ensure that /dev/kfd and /dev/dri are writeable by the render group users. Ensure you are added to the render group if necessary.
  3. Run launch-rocm.sh if you wish to only use ROCm with the docker image you built.

PyTorch Docker Image

  1. Run build-pytorch.sh if you wish to build a PyTorch image for your Linux System. It is built as a separate docker image, on top of the ROCm docker image you built earlier.
  2. Run launch-pytorch.sh if you wish to run PyTorch with the second image just now built.
  3. Run source /venv/bin/activate within the container and you should be able to run all the cool PyTorch things you need.
@jessecambon
Copy link

From the groups $USER command I confirmed I'm part of the render and video groups. The sticking point seems to be the access/permissions to /dev/kfd:

$ sudo chmod 777 /dev/kfd

$ /dev/kfd
bash: /dev/kfd: Permission denied

I did try restarting docker, but this didn't seem to help. How do I check if the docker process user can access the device?

@guilt
Copy link
Author

guilt commented Jul 24, 2024

Please see the updated instructions on the System76 documentation - they seem to have done a more up-to-date job of packaging it. With ROCM nearing 6.2.0 now I think it's high time newer versions are tested and documented too.

They seem to have a working apt install rocm step which is in many ways much simpler than this approach. The rest of the pytorch steps will be similar.

Did your setup fail with a reboot and a sudo? Would be interesting to debug this; I currently do not have this setup to test, even. Feel free to email me and setup a screen share session if you want to go down this path. Best thing I can offer, may not be needed for your case though. 💁

@jessecambon
Copy link

Thanks, @guilt. You're referring to these these instructions right? I ran the commands on there, but that didn't resolve the issue. I also tried editing the udev rules per this comment (edited /etc/udev/rules.d/70-amdgpu.rules and then ran sudo update-initramfs -c -k $(uname -r) ). This involved just adding TAG+="uaccess" to the end so the file now looks like this:

KERNEL=="kfd", GROUP=="video", MODE="0660",TAG+="uaccess"

However, this didn't fix it either. I normally use Docker Desktop and was having some trouble getting the docker daemon started on the root user, but I'll try to running with sudo once I get that working.

@guilt
Copy link
Author

guilt commented Jul 24, 2024

If you look at the group of that file per the ls, it is render and not video. So check for the correct group. If you see something, document it for everyone else. Best.

@jessecambon
Copy link

I tried a few different user:group configurations (chown) for /dev/kfd. I did change the group back to video, but that didn't work either.

I got sudo docker working, but I seem to have broken my rocm setup in the process (rocminfo now says "ROCk module is NOT loaded, possibly no GPU devices") . I'll try to do some more debugging, but a screen share might be helpful. I'll send you an email.

@jessecambon
Copy link

I ended up needing to use a live USB to repair my Pop OS install (I believe one of the initramfs commands I ran messed something up because I was unable to login after a reboot). After that I installed ROCM again via these instructions. Then, instead of using docker desktop, I installed docker.io via these commands:

sudo apt install docker.io
sudo usermod -aG docker $USER

Docker now only works via sudo, but I was able to get the ROCM-pytorch image to successfully run by running these commands as sudo:

sudo docker pull rocm/pytorch:latest
sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

For reference this is what /dev/kfd looks like:

$ ls -l /dev/kfd
crw-rw---- 1 root render 235, 0 Jul 24 10:19 /dev/kfd

Haven't tried stable diffusion or running anything in pytorch yet, but at least I'm past the previous error message. Thanks again for your help.

@guilt
Copy link
Author

guilt commented Jul 24, 2024

You're welcome. Please update that System76 ticket as well with what happened to you. Have a wonderful day.

@jessecambon
Copy link

I made a PR to add a note to the system76 docs here system76/docs#1242. I can link it in the prior ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment