Skip to content

Instantly share code, notes, and snippets.

@AmericanEnglish
Last active May 26, 2020 22:56
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AmericanEnglish/5c522541f1c4a648c24db344477608f4 to your computer and use it in GitHub Desktop.
Save AmericanEnglish/5c522541f1c4a648c24db344477608f4 to your computer and use it in GitHub Desktop.
How to compile TensorFlow 2.0 from source in an HPC environment that uses EasyBuild for CentOS 7.x

How to compile TensorFlow 2.0 from source in an HPC environment that uses EasyBuild for CentOS7

Some background about why another Gist on compiling tensorflow without root

If you're like me and you don't have root access to a HPC system and your system adminstrators use EasyBuild combined with Lmod you are going to be in for a bad time. Our CentOS7 did not ship out of the box capable of compiling tensorflow 2.0 from source with no problems. I scoured several closed and recently opened issues to write a script that enables the creation of a fully functional .whl for tensorflow 2.0. Note that on our cluster we have only two different sets of GPUs. One set has compute capability 7.0 (V100) and compute capability 3.5 (K20m). So many issues came up

  1. because Bazel can't handle symlinks for gcc leading to errors like
ERROR: /home/user/tests/tensorflowTest/fromsource/tensorflow/tensorflow/lite/c/BUILD:6:1: undeclared inclusion(s) in rule '//tensorflow/lite/c:c_api_internal':
this rule is missing dependency declarations for the following files included by 'tensorflow/lite/c/c_api_internal.c':
  '/usr/ebuild/software/GCCcore/9.2.0/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/stdbool.h'
  '/usr/ebuild/software/GCCcore/9.2.0/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/stddef.h'
  '/usr/ebuild/software/GCCcore/9.2.0/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/stdint.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build                                                                                                                                                                      INFO: Elapsed time: 13.962s, Critical Path: 6.44s
INFO: 3 processes: 3 local.
FAILED: Build did NOT complete successfully
  1. The tensorflow bazel rule downloads a version of swig which is statically compiled against /lib64/ and our system libstdc++.so.1 was far too old for swig.
  2. Everything has to be done in a miniconda environment as I don't have root access.

Pre-requisites beyond the script

There a certain level of expectations that I have for this to work for you.

  1. I expect that you have a module system, Lmod, and perhaps even have EasyBuild on your HPC system. You should be able to do things like
module load GCC

and so on. 2. You must have gcc 8.2.0 installed either at the system level or as a module. 3. Obviously you have to have git 4. You'll need wget if you don't have:

  1. CUDA toolkit,
  2. miniconda,
  3. NVIDIA User Drivers
  4. Bazel (0.29.1 used in the script)
  5. NCCL (this is optional of course)

Some helpful things to install with EasyBuild

If you're system adminstrators are willing to install some things at request you should of course request some packages and remove them from the install script. Here are my suggested packages:

  1. GCC https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/g/GCC/GCC-8.2.0-2.31.1.eb
  2. CUDA 10.1 https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/c/CUDA/CUDA-10.1.243.eb
  3. cuDNN 7 compatible with above CUDA https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/c/cuDNN/cuDNN-7.6.2.24-CUDA-10.1.243.eb

Walking through the script

Feel free to remove any module loads or downloading parts if you already have all that taken care of and the correct variables setup in your environment.

  1. Set up the compile environment by loading git and GCC.
# Tensorflow only compiles with GCC == 8
module load git
module load GCC/8.2.0-2.31.1
# Compilers required
export CC=gcc
export CXX=g++
export FC=gfortran
  1. On our system the /tmp/ direction is very small so I use an NFS directory for tmp though Bazel will complain A LOT.
# Change these as needed. Don't do this in /tmp/ or you'll be sorry!
export TMPDIR=/home/user/NFSdir/.tmp/
  1. Important for just moving, downloading, and what not.
CWD=$(pwd -P)
  1. Setting up a fodder environment for tensorflow.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod 0770 ./Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -p$CWD/miniconda3
export CONDA_PREFIX=$CWD/miniconda3/
export PYTHONPATH=$CONDA_PREFIX/bin
export PATH=$CONDA_PREFIX/bin:$PATH
$PYTHONPATH/pip install -U pip six numpy wheel setuptools mock 'future>=0.17.1'
$PYTHONPATH/pip install -U keras_applications --no-deps
$PYTHONPATH/pip install -U keras_preprocessing --no-deps
  1. cuDNN is behind a login wall on the NVIDIA website unless you use EasyBuild so I assume you've moved into $CWD Additionally the NVIDIA user drivers are missing a soft link for some reason? A lot of them but this one and possibly many more are required for correct options. I'll update them as I come across them.
# Setup cuDNN
# Download it prior to running this
tar -xf cudnn-10.1-linux-x64-v7.6.4.38.tgz
mv cuda cudnn-10.1-linux-x64-v7.6.4.38
export TMP=$TMPDIR
CUDNN=$CWD/cudnn-10.1-linux-x64-v7.6.4.38/
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDNN/lib64/"

# Setup NVIDIA Drivers 
wget  http://us.download.nvidia.com/tesla/418.87/NVIDIA-Linux-x86_64-418.87.01.run
chmod 0770 NVIDIA-Linux-x86_64-418.87.01.run
./NVIDIA-Linux-x86_64-418.87.01.run -x
DRIVERDIR=$CWD/NVIDIA-Linux-x86_64-418.87.01/
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$DRIVERDIR" 
# Missing a soft link for some reason
ln -s $DRIVERDIR/libcuda.so.418.87.01 $DRIVERDIR/libcuda.so.1
  1. Lines 45 through 66 download the toolkit and setup the environment for CUDA. The lines up to 75 are for downloading tensorflow from source.
  2. This was a fix proposed by a maintain of EasyBuild to be sure that the symlinks associated with compiler directories are used. The other patch just below that allows the use of a local copy of SWIG rather than the one Bazel downloads. I wrote a one line python script that find a spot and drops in the relevant patch. Changing a single character will of course break the patching process.
# Implement fix from EasyBuild GitHub for CUDA
# Seen here:
# https://github.com/tensorflow/tensorflow/issues/33975
# A single line python script
fixFile="from sys import argv; filename = argv[1]; get = argv[2]; rep = argv[3]; h = open(filename, 'r'); cont = h.read(); h.close(); cont = cont.split('\n'); ind = cont.index(get); cont[ind] = rep; h = open(filename, 'w'); h.write('\n'.join(cont)); h.close();print('Sub Completed')"
REPLACE="    cuda_defines = {}"
SUBIN="# fix include path by also including paths where resolved symlink is replaced by
# original path
    cc_topdir = str(repository_ctx.path(repository_ctx.path(cc_fullpath).dirname).dirname)
    cc_topdir_resolved = str(repository_ctx.path(str(cc_topdir)).realpath)
    if cc_topdir_resolved != cc_topdir:
        original_host_compiler_includes = [p.replace(cc_topdir_resolved, cc_topdir) for p in host_compiler_includes]
        host_compiler_includes = host_compiler_includes + original_host_compiler_includes

    cuda_defines = {}"
EDITME=./tensorflow/third_party/gpus/cuda_configure.bzl
#A one line python command to update the file with the easy build patch
$PYTHONPATH/python -c "$fixFile" $EDITME "$REPLACE" "$SUBIN"
# Modify tensorflow to use the local version of SWIG
# Fix found here:
# https://github.com/bazelbuild/bazel/issues/4053
EDITME=./tensorflow/tensorflow/tensorflow.bzl
REPLACE="    ctx.actions.run("
SUBIN="    ctx.actions.run(
        use_default_shell_env=True,"
$PYTHONPATH/python -c "$fixFile" $EDITME "$REPLACE" "$SUBIN"
  1. Lines 111 -> 122 settings up SWIG
  2. Lines 127 -> 147 establish environment variables before configure. That way no user interaction is required. It also allows for the use of bash variables. The CC_OPT_FLAGS are for the three architectures on our cluster. Similarly TD_CUDA_COMPUTE_CAPABILITIES are for our two sets of GPU nodes.
TF_CUDA_COMPUTE_CAPABILITIES=3.5,7.0 \
CC_OPT_FLAGS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-march=skylake-avx512,IVYBRIDGE,SSE4.2" \
  1. Configure tensorflow, build it with bazel, and extract the .whl file.
./configure 

# Hyper paranoid about getting things where they need to be
export TEST_TMPDIR=$TMPDIR

# Just an amalgamation from various sources. 
bazel --output_user_root=$TMPDIR/bazel --output_base=$TMPDIR/ \
    build --verbose_failures \
    --config=cuda \
    --spawn_strategy=standalone --genrule_strategy=standalone \
    //tensorflow/tools/pip_package:build_pip_package
    
./bazel-bin/tensorflow/tools/pip_package/build_pip_package $TMP/tensorflow_pkg 

mv $TMP/tensorflow_pkg/tensorflow-*.whl $CWD/

bazel shutdown

The line

--spawn_strategy=standalone --genrule_strategy=standalone \

allows for the use of the local SWIG that was downloaded and compiled earlier.

Closing Remarks

I hope this gives you enough to be able to do this on your cluster without having to nag your system adminstrators or even better allow a fully automated build process for whatever you may need.

#!/bin/sh
# Tensorflow only compiles with GCC == 8
module load git
module load GCC/8.2.0-2.31.1
# Compilers required
export CC=gcc
export CXX=g++
export FC=gfortran
#export LD_SHARED="icc -shared"
# Change these as needed. Don't do this in /tmp/ or you'll be sorry!
export TMPDIR=/home/user/NFSdir/.tmp/
CWD=$(pwd -P)
# Get a miniconda distribution
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod 0770 ./Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -p$CWD/miniconda3
export CONDA_PREFIX=$CWD/miniconda3/
export PYTHONPATH=$CONDA_PREFIX/bin
export PATH=$CONDA_PREFIX/bin:$PATH
$PYTHONPATH/pip install -U pip six numpy wheel setuptools mock 'future>=0.17.1'
$PYTHONPATH/pip install -U keras_applications --no-deps
$PYTHONPATH/pip install -U keras_preprocessing --no-deps
# Setup cuDNN
# Download it prior to running this
tar -xf cudnn-10.1-linux-x64-v7.6.4.38.tgz
mv cuda cudnn-10.1-linux-x64-v7.6.4.38
export TMP=$TMPDIR
CUDNN=$CWD/cudnn-10.1-linux-x64-v7.6.4.38/
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDNN/lib64/"
# Setup NVIDIA Drivers
wget http://us.download.nvidia.com/tesla/418.87/NVIDIA-Linux-x86_64-418.87.01.run
chmod 0770 NVIDIA-Linux-x86_64-418.87.01.run
./NVIDIA-Linux-x86_64-418.87.01.run -x
DRIVERDIR=$CWD/NVIDIA-Linux-x86_64-418.87.01/
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$DRIVERDIR"
# Missing a soft link for some reason
ln -s $DRIVERDIR/libcuda.so.418.87.01 $DRIVERDIR/libcuda.so.1
# Setup CUDA 10.1
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
chmod 0770 cuda_10.1.243_418.87.00_linux.run
mkdir extractHere
./cuda_10.1.243_418.87.00_linux.run \
--tmpdir=/home/barajasc/gobbert_user/.tmp/ \
--extract=$CWD/extractHere/
mv extractHere/cuda-toolkit ./cuda-toolkit-10/
rm -rf extractHere
TOOLKITDIR=$CWD/cuda-toolkit-10/
export PATH="$PATH:$TOOLKITDIR/bin"
export CUDADIR=$TOOLKITDIR
export CUDA_HOME="$TOOLKITDIR" && \
export CUDA_PATH="$TOOLKITDIR" && \
export CUDA_ROOT="$TOOLKITDIR" && \
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$TOOLKITDIR/lib64" && \
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$TOOLKITDIR/extras/CUPTI/lib64" && \
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$TOOLKITDIR/nvvm/lib64" && \
export LIBRARY_PATH="$LIBRARY_PATH:$TOOLKITDIR/lib64" && \
export LIBRARY_PATH="$LIBRARY_PATH:$TOOLKITDIR/lib64/stubs" && \
export PATH="$PATH:$TOOLKITDIR/bin" && \
export PATH="$PATH:$TOOLKITDIR/nvvm/bin"
# NCCL
tar -xf nccl_2.4.8-1+cuda10.1_x86_64.txz
NCCL=nccl_2.4.8-1+cuda10.1_x86_64/
# Download TensorFlow
git clone --recurse-submodules https://github.com/tensorflow/tensorflow
# Implement fix from EasyBuild GitHub for CUDA
# Seen here:
# https://github.com/tensorflow/tensorflow/issues/33975
# A single line python script
fixFile="from sys import argv; filename = argv[1]; get = argv[2]; rep = argv[3]; h = open(filename, 'r'); cont = h.read(); h.close(); cont = cont.split('\n'); ind = cont.index(get); cont[ind] = rep; h = open(filename, 'w'); h.write('\n'.join(cont)); h.close();print('Sub Completed')"
REPLACE=" cuda_defines = {}"
SUBIN="# fix include path by also including paths where resolved symlink is replaced by
# original path
cc_topdir = str(repository_ctx.path(repository_ctx.path(cc_fullpath).dirname).dirname)
cc_topdir_resolved = str(repository_ctx.path(str(cc_topdir)).realpath)
if cc_topdir_resolved != cc_topdir:
original_host_compiler_includes = [p.replace(cc_topdir_resolved, cc_topdir) for p in host_compiler_includes]
host_compiler_includes = host_compiler_includes + original_host_compiler_includes
cuda_defines = {}"
EDITME=./tensorflow/third_party/gpus/cuda_configure.bzl
#A one line python command to update the file with the easy build patch
$PYTHONPATH/python -c "$fixFile" $EDITME "$REPLACE" "$SUBIN"
# Modify tensorflow to use the local version of SWIG
# Fix found here:
# https://github.com/bazelbuild/bazel/issues/4053
EDITME=./tensorflow/tensorflow/tensorflow.bzl
REPLACE=" ctx.actions.run("
SUBIN=" ctx.actions.run(
use_default_shell_env=True,"
$PYTHONPATH/python -c "$fixFile" $EDITME "$REPLACE" "$SUBIN"
# Download Bazel
wget https://github.com/bazelbuild/bazel/releases/download/0.29.1/bazel-0.29.1-linux-x86_64
mkdir bazel/
mv bazel-* bazel/bazel
chmod 0770 bazel/bazel
export PATH="$PATH:$(pwd -P)/bazel/"
# Now you need to download and compile SWIG because the default one links to a
# newer version of libstdc++.so
wget https://sourceforge.net/projects/swig/files/swig/swig-4.0.1/swig-4.0.1.tar.gz/download -O swig.tar.gz
tar xf swig.tar.gz
mkdir swig
cd swig-4.0.1
./configure --prefix=$CWD/swig
make
make install
cd ../
export PATH="$PATH:$CWD/swig/bin/"
rm -rf swig-4.0.1/ swig.tar.gz
cd tensorflow
# Set all the variables for tensorflow because so no user input is required
export TF_NEED_CUDA=1 \
TF_CUDA_PATHS=$TOOLKITDIR,$CUDNN \
TF_CUDA_VERSION=10.1 \
GCC_HOST_COMPILER_PATH=$(which gcc) \
CUDNN_INSTALL_PATH=$CUDNN \
TF_CUDA_COMPUTE_CAPABILITIES=3.5,7.0 \
CC_OPT_FLAGS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-march=skylake-avx512,IVYBRIDGE,SSE4.2" \
PYTHON_BIN_PATH=$PYTHONPATH/python3 \
USE_DEFAULT_PYTHON_LIB_PATH=0 \
PYTHON_LIB_PATH=$CONDA_PREFIX/lib/ \
NCCL_INSTALL_PATH=$NCCL \
TF_NEED_JEMALLOC=1 \
TF_NEED_GCP=0 \
TF_NEED_HDFS=0 \
TF_ENABLE_XLA=1 \
TF_NEED_OPENCL=0 \
TF_NEED_ROCM=0 \
TF_NEED_TENSORRT=0 \
TF_NEED_OPENCL_SYCL=0 \
TF_CUDA_CLANG=0 \
TF_SET_ANDROID_WORKSPACE=0
./configure
# Hyper paranoid about getting things where they need to be
export TEST_TMPDIR=$TMPDIR
# Just an amalgamation from various sources.
bazel --output_user_root=$TMPDIR/bazel --output_base=$TMPDIR/ \
build --verbose_failures \
--config=cuda \
--spawn_strategy=standalone --genrule_strategy=standalone \
//tensorflow/tools/pip_package:build_pip_package
./bazel-bin/tensorflow/tools/pip_package/build_pip_package $TMP/tensorflow_pkg
mv $TMP/tensorflow_pkg/tensorflow-*.whl $CWD/
bazel shutdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment