rasmi/yeti-setup.md

## yeti-setup.md

      
    Raw
  

              yeti-setup.md
            
          
    Python Setup

Create an alias for the directory where we'll do our installation and computing.
$WORK = /vega/<group>/users/<username>

Now, install and setup the latest version of Python (2 or 3).
cd $WORK
mkdir applications
cd applications
mkdir python
cd python
wget https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
tar -xvzf Python-2.7.12.tgz
find Python-2.7.12 -type d | xargs chmod 0755
cd Python-2.7.12
./configure --prefix=$WORK/applications/python --enable-shared
make && make install
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$WORK/applications/python/lib"

You can add this Python to your path, but I am just going to work entirely out of virtual environments and will leave the default path as-is. If you're particular about folder structure, you can install specific Python versions in (for example) $WORK/applications/python/Python-2.7.12 to keep separate versions well-organized and easily available.
Now, we'll install pip.
cd $WORK/applications
wget https://bootstrap.pypa.io/get-pip.py
$WORK/applications/python/bin/python get-pip.py

Now to install and set up a virtualenv:
$WORK/applications/python/bin/pip install virtualenv
cd $WORK/applications
$WORK/applications/python/bin/virtualenv pythonenv

Now, create an alias in your ~/.profile to allow easy access to the virtualenv.
alias pythonenv="source $WORK/applications/pythonenv/bin/activate"

There you have it! Your own local python installation in a virtualenv just a pythonenv command away. You can also install multiple Python versions and pick which one you want for a particular virtualenv. Nice and self-contained.
Bazel Setup

The TensorFlow binary requires GLIBC 2.14, but Yeti runs RHEL 6.7, which ships with GLIBC 2.12. Installing a new GLIBC from source will lead you down a rabbit hole of system dependencies and compilation errors, but we have another option. Installing Bazel will let us compile TensorFlow from source. Bazel requires OpenJDK 8:
# Do this in an interactive session because submit queues don't have enough memory.
cd $WORK/applications
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-x64.tar.gz
tar -xzf jdk-8u112-linux-x64.tar.gz

Add these two lines to your ~/.profile:
export PATH=$WORK/applications/jdk1.8.0_112/bin:$PATH
export JAVA_HOME=$WORK/applications/jdk1.8.0_112

Now, get a copy of Bazel. We also need to load a newer copy of gcc to compile Bazel:
wget https://github.com/bazelbuild/bazel/releases/download/0.4.2/bazel-0.4.2-dist.zip
unzip bazel-0.4.2-dist.zip -d bazel
cd bazel
module load gcc/4.9.1
./compile.sh

Add the following to your ~/.profile:
export PATH=$WORK/applications/bazel/output:$PATH

TensorFlow Setup

We're going to install TensorFlow from source using Bazel.
Make sure numpy is installed in your pythonenv: pip install numpy.
Clone the TensorFlow repository.
cd $WORK/applications
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout r0.12

We also need to install swig:
cd $WORK/applications
# Get swig-3.0.10.tar.gz from SourceForge.
tar -xzf swig-3.0.10.tar.gz
mkdir swig
cd swig-3.0.10
./configure --prefix=$WORK/applications/swig
make
make install

Add the following to your ~/.profile:
export PATH=$WORK/applications/swig/bin:$PATH

We need to set the following environment variables. Add them to your ~/.profile:
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-7.5/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda-7.5

Note that /usr/local/cuda-7.5/lib64 is automatically added to $LD_LIBRARY_PATH when you run module load cuda, so we only need to add the other directories. Also note that /usr/local/cuda is symlinked to /usr/local/cuda-7.5, so you don't need to include the versions in the path directories, but I'm doing it to be explicit.
To install TensorFlow, we just need to load some GPU nodes and libraries, which we can also access in an interactive session. Running module load cuda loads CUDA 7.5 and cuDNN. Then we can install with Bazel:
# This gives you a 1-hour interactive session with GPU support.
# It may take a while to start the interactive session, depending on current wait times.
qsub -I -W group_list=<yetigroup> -l walltime=01:00:00,nodes=1:gpus=1:exclusive_process
# Use latest available gcc for compatibility.
# CUDA loads 7.5 by default.
# Load the proxy to allow TF to download and install protobuf and other dependencies.
module load gcc/4.9.1 cuda proxy 
pythonenv
cd $WORK/applications/tensorflow
./configure
# I used all the default settings except for CUDA compute capabilities, which I set to 3.5 for our k20 and k40 GPUs. 

Once that is done, make the following change to third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl to add the -fno-use-linker-plugin compiler flag:
index 20449a1..48a4e60 100755
--- a/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
+++ b/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
@@ -309,6 +309,7 @@ def main():
     # TODO(eliben): rename to a more descriptive name.
     cpu_compiler_flags.append('-D__GCUDACC_HOST__')
 
+  cpu_compiler_flags.append('-fno-use-linker-plugin')
   return subprocess.call([CPU_COMPILER] + cpu_compiler_flags)
 
 if __name__ == '__main__':

Now we can build with Bazel:
bazel build -c opt --config=cuda --verbose_failures //tensorflow/cc:tutorials_example_trainer

The build should fail with an error that goes something like undefined reference to symbol 'ceil@@GLIBC_2.2.5' or undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'. To fix this, modify LINK_OPTS in bazel-tensorflow/external/protobuf/BUILD by adding the -lm and -lrt flags to //conditions:default:
LINK_OPTS = select({
    ":android": [],
    "//conditions:default": ["-lpthread", "-lm", "-lrt"],
})

Re-start the build and run the sample trainer:
bazel build -c opt --config=cuda --verbose_failures //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

If everything goes okay, build the pip wheel:
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package $WORK/applications/tensorflow
pip install $WORK/applications/tensorflow/tensorflow-0.12.0-cp27-cp27m-linux_x86_64.whl

Testing TensorFlow

Try training on MNIST data to see if your installation works:
cd tensorflow/models/image/mnist
python convolutional.py

Troubleshooting


undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'

Add --linkopt=-lrt flag to bazel build or see LINK_OPTS fix above.


(directory not empty) error during ./configure:

Change bazel clean --expunge to bazel clean --expunge_async in ./configure.


Linking of rule '@protobuf//:protoc' failed: crosstool_wrapper_driver_is_not_gcc failed: with /usr/bin/ld: unrecognized option '-plugin'

Add -fno-use-linker-plugin flag to compiler.


Highwayhash issues.

Compile with gcc 4.9.1 or make changes here.


Undefined reference to symbol 'ceil@@GLIBC_2.2.5

See LINK_OPTS fix above.


ERROR: no such package '@local_config_cuda//crosstool': BUILD file not found on package path.

Re-run ./configure before re-running bazel build.


tensorflow/stream_executor/cuda/cuda_driver.cc:383] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216)

Set compute context to DEFAULT or EXCLUSIVE_PROCESS by adding the exclusive_process flag to the qsub call (see above).
Set CUDA_VISIBLE_DEVICES to the one in exclusive_process mode.