Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Installing Tensorflow on CENTOS 6.8 Cluster without Root Access

Environment:

OS: CENTOS 6.8 (No root access)

GCC: locally installed 5.2.0 (Cluster default is 4.4.7)

Bazel: 0.4.0-2016-11-06 (@fa407e5)

Tensorflow: v0.11.0rc2

CUDA: 8.0

CUDNN: 5.1.5

Steps:

You should be able to modify the script (buildtf.sh) below to do these steps automatically, but I list out details here as well.

Installing Java Locally:

Follow this Tutorial or download prefered version of JDK 8.0 and set proper environment variables as described in the tutorial.

Compiling Bazel, Compiling and Installing Tensorflow:

Great Tutorial that got me to the error below!

Note: After change the linker line to your local or module GCC, If you get errors about finding ld, or other executables that are stored in /usr/bin here is the work around I used (it isn't pretty and you might not need it, but just in case):

  1. Copy your compiler directory (/opt/gcc/5.2.0) to a local directory that you have permissions to modify.

  2. Then run:

cp `which ld` /opt/gcc/5.2.0/bin/ld (repeat for any command listed in the crosstools that doesn't already reside in your gcc /bin directory)

Note2: I downloaded a newer release of bazel and tensorflow as noted above and there are fewer changes required in the latest versions of the crosstool then described in the tutorial.

  1. modify /tensorflow/third_party/gpus/crosstool/CROSSTOOL.tpl as described in tutorial above

  2. modify /tensorflow/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl as described in tutorial above. I did not change the first line: #!/usr/bin/env python (but the tutorial does!)

Again these steps led to the below error which took me forever to get past:

GLIBCXX_3.4.18 not found error

Getting Past GBLICXX_3.4.18 Error:

As described in gbkedar's comment from Jul 12. You have to find this file:

$INSTALL_PATH/tensorflow/bazel-tensorflow/external/protobuf/protobuf.bzl

But, until the compile fails this file is harder to find. (The buildtf.sh re-runs the compile after modifying the file after the first failure). The failure creates the shortcut in the /tensorflow directory. I was running into issues re-attempting the compile and had to run ./configure almost everytime. Therefore, I had to find this file before the first failure of my compile attempt. The file should be located somewhere similar to this after running ./configure from the /tensorflow directory:

~/.cache/bazel/_bazel_YOURUSERNAME/YOURHASH(i.e. f81f1107f96c7515450fc43e0dbb6ed5)/external/protobuf/protobuf.bzl

If you have several hashes, check the files that were modified at the time corresponding to your ./configure run.

As described in the error link above, search for ctx.action and add env=ctx.configuration.default_shell_env, at the bottom of the call like so:

  if args:
    ctx.action(
        inputs=inputs,
        outputs=ctx.outputs.outs,
        arguments=args + import_flags + [s.path for s in srcs],
        executable=ctx.executable.protoc,
        mnemonic="ProtoCompile",
        env=ctx.configuration.default_shell_env,
    )

You will then likely hit error trying to exec 'as': execvp: No such file or directory. Since I am a self-confessing linux noob, you have to use the few tricks you know as much as possible(I didn't follow gbkedar's 2nd comment):

cp `which as` /opt/gcc/5.2.0/bin/as

After this change, tensorflow finally compiled successfully for me!

Building .whl file:

Going back to our tutorial I ran this command:

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

and received the bdist_wheel not found error... I solved this by using pip install to install a new version of wheel locally:

pip install --target=/home/thpaul/python27-packages wheel

and then added that directory to my $PYTHONPATH variable:

export PYTHONPATH=/home/thpaul/python27-packages/:$PYTHONPATH

Re-running the command builds the proper .whl file which you can install via pip.

Hope this helps anyone trying to compile tensorflow from source!

#installs TF and all required dependencies except CUDNN* without root!
#*Requires signing up for account to download! (Pretty easy, but do this first!)
#https://developer.nvidia.com/cudnn
#Original Environment: CENTOS 6.8, non-standard GCC = 5.2.0
#To note, I copied every binary (ld, as, etc..) required by BAZEL (see tensorflow CROSSTOOL.tpl)
#into my GCC_DIR!
#TODO: There are a couple TODO's listed that will be system specific!
# Ensure we can load CUDA drivers.
module load cuda/8.0 || { echo 'Failed to load CUDA drivers. Are you not on a compute node?' ; exit 1; }
#TODO: GCC_DIR/LOCAL_INCLUDE/LOCAL_LIBRARY if not standard system gcc (which gcc)
STARTDIR=`pwd`/tf_tools
GCC_DIR=/work/thpaul/gcc/5.2.0
BAZEL_BIN_DIR=/work/thpaul/bin #/bin where to copy bazel binary
PYTHON_INSTALL_DIR=python27
JAVA_DIR=jdk1.8.0_102 #Directory you jdk.tar file extracts too (depends on which version you DL)
LOCAL_INCLUDE=$STARTDIR/include
LOCAL_LIBRARY=$STARTDIR/lib
#VERSIONS
PYTHON_VERSION=2.7.12
JAVA_FILE=jdk-8u102-linux-x64 #Update Java version in DOWNLOADS too...
BAZEL_VERSION=0.4.0 #TAG from github https://github.com/bazelbuild/bazel, don't use if latest release
TF_VERSION=v0.11.0rc2 #TAG from https://github.com/tensorflow/tensorflow/releases, don't use if latest release
#DOWNLOADS
https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
wget https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz -O setuptools-1.4.2.tar.gz
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u102-b14/jdk-8u102-linux-x64.tar.gz
wget https://sqlite.org/2016/sqlite-autoconf-3150100.tar.gz #TODO: update if newer version needed (3.8.6)
echo "Buidling Directories"
mkdir -p $STARTDIR
cd $STARTDIR
mkdir -p $PYTHON_INSTALL_DIR
cd $PYTHON_INSTALL_DIR
PYTHON_INSTALL_DIR=`pwd`
cd ..
# Set tmp directory to userspace
mkdir -p tmp
cd tmp
TMPDIR=`pwd`
cd ..
#Unzip archives
echo "Decompressing archives"
tar zxvf ../Python-$PYTHON_VERSION.tgz
tar --totals -xvf ../setuptools-1.4.2.tar.gz
tar --totals -xvf ../$JAVA_FILE.tar.gz
tar --totals -xvf ../sqlite-autoconf-3150100.tar.gz
cd sqlite-autoconf-3150100
SQLITE_INSTALL_DIR=`pwd`
echo "Installing sqlite3 libs at `pwd`!"
./configure --enable-shared --prefix=$SQLITE_INSTALL_DIR
make
make install
cp ./include/* $LOCAL_INCLUDE
cp ./lib/* $LOCAL_LIBRARY
cd ..
cd Python-$PYTHON_VERSION
echo "Installing python at $PYTHON_INSTALL_DIR"
#TODO: Have to change setup.py to look in local include file for sqlite3 libraries
sed -i 's#/usr/local/include/sqlite3#'$LOCAL_INCLUDE'#g' ./setup.py
./configure --enable-shared --prefix=$PYTHON_INSTALL_DIR --enable-loadable-sqlite-extensions #TODO: need sqlite3 for nltk and others
make
make altinstall
export PATH=$PYTHON_INSTALL_DIR/bin:$PATH
cd ..
echo "----- Installing Pip"
cd setuptools-1.4.2
export LD_LIBRARY_PATH=$PYTHON_INSTALL_DIR/lib:$LD_LIBRARY_PATH
$PYTHON_INSTALL_DIR/bin/python2.7 setup.py install
curl https://bootstrap.pypa.io/get-pip.py | $PYTHON_INSTALL_DIR/bin/python2.7 -
pip install --no-cache-dir numpy
pip install -U nltk
cd ..
cd $JAVA_DIR
echo "Installing JAVA at `pwd`"
#Save JAVA variables:
JAVA_INSTALL_DIR=`pwd`
export JAVA_HOME=$JAVA_INSTALL_DIR
export JAVA_JRE=$JAVA_INSTALL_DIR/jdk1.8.0_102/jre
export PATH=$PATH:$JAVA_INSTALL_DIR/jdk1.8.0_102/bin:$JAVA_INSTALL_DIR/jdk1.8.0_102/jre/bin
cd ..
echo "Compiling bazel in `pwd`/bazel"
git clone https://github.com/bazelbuild/bazel.git
git checkout $BAZEL_VERSION #TODO: Format Specific to your git version, only need if not using bazel latest-release
cd bazel
./compile.sh
wait #TODO: Seems to want to compile twice???
cp ./output/bazel $BAZEL_BIN_DIR/bazel
cd ..
echo "Compiling tensorflow in `pwd`/tensorflow"
git clone https://github.com/tensorflow/tensorflow.git
git checkout $TF_VERSION #TODO: Format Specific to your git version if not latest-release
cd tensorflow
TF_INSTALL_DIR=`pwd`
# TODO: Adjust the configure file only if .cache is on an NFS and clean fails:
cp configure configure_orig #just in case
sed -i 's/bazel clean --expunge/bazel clean --expunge_async/g' configure
#Modify tensorflow CROSSTOOL.tpl file:
cp $TF_INSTALL_DIR/third_party/gpus/crosstool/CROSSTOOL.tpl ./third_party/gpus/crosstool/CROSSTOOL_ORIG.tpl
sed -i 's#/usr/bin#'$GCC_DIR'/bin#g' $TF_INSTALL_DIR/third_party/gpus/crosstool/CROSSTOOL.tpl
#Modify tensorflow crosstool_wrapper_driver_is_not_gcc.tpl file
cp $TF_INSTALL_DIR/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl \
$TF_INSTALL_DIR/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc_ORIG.tpl
sed -i 's#/usr/bin/gcc/#'$GCC_DIR'/bin/gcc#g' \
$TF_INSTALL_DIR/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
./configure
wait
#Can't just use the basic call from tensorflow.org install directions:
bazel build -c opt --config=cuda --genrule_strategy=standalone --spawn_strategy=standalone //tensorflow/tools/pip_package:build_pip_package
wait
#TODO: When/If fails with GBLICXX... error, run this afterwards:
PROTOFILE=$(readlink -f -- "$TF_INSTALL_DIR/bazel-tensorflow/external/protobuf/protobuf.bzl")
cp $TF_INSTALL_DIR/bazel-tensorflow/external/protobuf/protobuf.bzl $TF_INSTALL_DIR/bazel-tensorflow/external/protobuf/ORIG_protobuf.bzl
sed -i 's/mnemonic="ProtoCompile",/mnemonic="ProtoCompile", env=ctx.configuration.default_shell_env,/g' \
$PROTOFILE
bazel build -c opt --config=cuda --genrule_strategy=standalone --spawn_strategy=standalone //tensorflow/tools/pip_package:build_pip_package
wait
bazel-bin/tensorflow/tools/pip_package/build_pip_package $TMPDIR/tensorflow_pkg
#Get name of the created whl file:
for filename in $TMPDIR/tensorflow_pkg/*;
do
export TF_WHEEL_FILE=$filename
done
#Finally install TF!
pip install $TF_WHEEL_FILE
echo "====================CAVEATS=============================="
echo "Don't forget to update necessary Environment Variables for in .bash_profile!"
echo 'echo "export PATH='$JAVA_INSTALL_DIR'/bin:'$JAVA_INSTALL_DIR'jre/bin:$PATH" >> ~/.bash_profile'
echo 'echo "export PATH='$PYTHON_INSTALL_DIR'/bin:'$BAZEL_BIN_DIR'/bin:$PATH" >> ~/.bash_profile'
echo 'echo "export JAVA_HOME='$JAVA_INSTALL_DIR'" >> ~/.bash_profile'
echo 'echo "export JAVA_JRE='$JAVA_INSTALL_DIR'/jre" >> ~/.bash_profile'

Worked for me with minor tweaks on Scientific Linux 6.6 -- thank you!

i3v commented Dec 1, 2016

This answer describes another possible approach to fixing the problem with "as" - to hardlink "as","ld", and "nm" when building gcc. There's also a link to a related issue on TF github, where you can find a link to an issue on bazel github. For now - it is still open, so maybe they would fix it sometime.

mrdivine commented Feb 9, 2017

I have been fighting with this for the past two days. Looking through your script, I begin to recount the hardships I have endured. Quick question-- would this work for Cuba 7.5? The reason I ask is because that's what's installed on the cluster already, and I can't seem to install updated drivers i.e. cuda 8 without root privileges. Any suggestion?

Owner

taylorpaul commented Mar 2, 2017 edited

@mrdivine. Sorry for taking so long to get back! I asked the maintainers of my cluster to install cuda 8.0. They were willing since we already had 7.5 installed. That being said, I am pretty sure this should work with older versions of cuda. You just have to provide the location of your cuda library anytime the tensorflow install asks for it. And be sure to install CUDNN locally following the link at the top of the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment