Skip to content

Instantly share code, notes, and snippets.

@briansp2020
Created November 19, 2023 04:58
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save briansp2020/1e8c3e5735087398ebfd9514f26a0007 to your computer and use it in GitHub Desktop.
Save briansp2020/1e8c3e5735087398ebfd9514f26a0007 to your computer and use it in GitHub Desktop.
Build TensorFlow r2.14
#/bin/sh
cd
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export PYTORCH_ROCM_ARCH="gfx1100"
export HIP_VISIBLE_DEVICES=0
export ROCM_PATH=/opt/rocm
echo "export HSA_OVERRIDE_GFX_VERSION=11.0.0" | tee --append .bashrc
echo "export PYTORCH_ROCM_ARCH=\"gfx1100\"" | tee --append .bashrc
echo "export HIP_VISIBLE_DEVICES=0" | tee --append .bashrc
echo "export ROCM_PATH=/opt/rocm" | tee --append .bashrc
python3 -m venv tf
source ~/tf/bin/activate
# install bazel 6.1
apt install apt-transport-https curl gnupg -y
curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor >bazel-archive-keyring.gpg
mv bazel-archive-keyring.gpg /usr/share/keyrings
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list
apt update && apt install -y bazel-6.1.0
ln -s /usr/bin/bazel-6.1.0 /usr/bin/bazel
apt install rocm-dev patchelf libboost-filesystem-dev -y
# Install updated rocPRIM
#cd ~
#git clone https://github.com/briansp2020/rocPRIM
#cd rocPRIM; mkdir build; cd build
#CXX=hipcc cmake -DBUILD_BENCHMARK=ON ../.
#make -j $(nproc)
#make install
# Build Tensorflow 2.14
cd ~
git clone https://github.com/ROCmSoftwarePlatform/tensorflow-upstream
cd tensorflow-upstream
git checkout r2.14-rocm-enhanced
#sed -i 's/5.7.0/5.7.1/g' build_rocm_python3
sed -i 's/"gfx1030" /"gfx1030",/g' tensorflow/compiler/xla/stream_executor/device_description.h
./build_rocm_python3
@jcnelson30
Copy link

Hey Brian, I appreciate all of the posts I've seen of yours around trying to enable tensor flow and the ROCM repository.

I have confirmed PyTorch does run properly in a docker container and is executing on the GPU.

I'm trying to now trying to adapt this r2.14-rocm-enchanced branch to work in a tensor flow docker container. I heavily prefer docker since I can rip down the containers after I screw them up.

What I currently have so far is:

FROM rocm/tensorflow:rocm5.7-tf2.13-dev

# Below branch already has the gfx1100 changes
RUN sudo git clone --depth 1 -b r2.14-rocm-enhanced https://github.com/ROCmSoftwarePlatform/tensorflow-upstream.git
WORKDIR tensorflow-upstream
RUN ./build_rocm_python3

I'm using one of those containers since all of the dependencies are including to build it (or allegedly) i.e. ROCM 5.7. When I build the dockerfile, I run into:

Loading: 0 packages loaded
INFO: Repository local_config_rocm instantiated at:
  /root/tensorflow-upstream/WORKSPACE:80:14: in <toplevel>
  /root/tensorflow-upstream/tensorflow/workspace2.bzl:1016:19: in workspace
  /root/tensorflow-upstream/tensorflow/workspace2.bzl:111:19: in _tf_toolchains
Repository rule rocm_configure defined at:
  /root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl:858:33: in <toplevel>
ERROR: An error occurred during the fetch of repository 'local_config_rocm':
   Traceback (most recent call last):
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 839, column 38, in _rocm_autoconf_impl
		_create_local_rocm_repository(repository_ctx)
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 574, column 35, in _create_local_rocm_repository
		rocm_config = _get_rocm_config(repository_ctx, bash_bin, find_rocm_config_script)
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 425, column 41, in _get_rocm_config
		amdgpu_targets = _amdgpu_targets(repository_ctx, rocm_toolkit_path, bash_bin),
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 236, column 32, in _amdgpu_targets
		auto_configure_fail("Invalid AMDGPU target: %s" % amdgpu_target)
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 154, column 9, in auto_configure_fail
		fail("\n%sROCm Configuration Error:%s %s\n" % (red, no_color, msg))
Error in fail: 
ROCm Configuration Error: Invalid AMDGPU target: 
ERROR: /root/tensorflow-upstream/WORKSPACE:80:14: fetching rocm_configure rule //external:local_config_rocm: Traceback (most recent call last):
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 839, column 38, in _rocm_autoconf_impl
		_create_local_rocm_repository(repository_ctx)
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 574, column 35, in _create_local_rocm_repository
		rocm_config = _get_rocm_config(repository_ctx, bash_bin, find_rocm_config_script)
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 425, column 41, in _get_rocm_config
		amdgpu_targets = _amdgpu_targets(repository_ctx, rocm_toolkit_path, bash_bin),
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 236, column 32, in _amdgpu_targets
		auto_configure_fail("Invalid AMDGPU target: %s" % amdgpu_target)
	File "/root/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 154, column 9, in auto_configure_fail
		fail("\n%sROCm Configuration Error:%s %s\n" % (red, no_color, msg))
Error in fail: 

However, when I run the following:

FROM rocm/tensorflow:rocm5.7-tf2.13-dev
RUN sudo git clone --depth 1 -b r2.14-rocm-enhanced https://github.com/ROCmSoftwarePlatform/tensorflow-upstream.git
WORKDIR tensorflow-upstream

Then run

sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined my-test-container

Then run the tf ./build_rocm_python3 from tensor flow upstream, that seems to be building okay. So I think I need some way in the build script to be able to pass the gpu during the build process?

Was wondering if you might have ideas on how I might do that since you're much more versed in this than I am. I have a lot of legacy tensor flow projects I have built that I need to be able to use both PyTorch + Tensorflow going forward.

Thanks!

@briansp2020
Copy link
Author

What GPUs do you have in your system? I noticed that TF seems to compile code differently depending on the hardware it detects. Do you have multiple GPUs in your system? Do you have AMD iGPU? I personally have not seen the error so I'm not sure if I can help you any further.

@dmikushin
Copy link

Hi, thanks for the script. For gfx1100 there will be no KDB for MIOpen. Without KDB, the best choices of parameters won't be know. How much performance impact does it have, according to your experience?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment