Skip to content

Instantly share code, notes, and snippets.

@tnachen
Created October 15, 2018 00:49
Show Gist options
  • Save tnachen/9c58755c5f59d40179c5177e71cb1e30 to your computer and use it in GitHub Desktop.
Save tnachen/9c58755c5f59d40179c5177e71cb1e30 to your computer and use it in GitHub Desktop.
Salus build instructions
you have to do the build step manually. But I think it serves as a good start point. The image is available on Docker Hub: https://hub.docker.com/r/qi437103/salus/tags/
You will need to start the docker with Nvidia runtime (see https://github.com/NVIDIA/nvidia-docker).
docker --runtime nvidia -it qi437103/salus:latest
# After starting the docker container, go to /salus
cd /salus
# get sources
git clone https://github.com/SymbioticLab/tensorflow-salus.git tensorflow
git clone https://github.com/SymbioticLab/Salus.git salus
# install dependencies
## There's currently some issue with the installation URL of zeromq, so you need to edit it first
apt update && apt install -y vim
spack edit zeromq
## Change the line 38 from "version('4.2.2', '52499909b29604c1e47a86f1cb6a9115')"
## to "version('4.2.2', '52499909b29604c1e47a86f1cb6a9115', url='https://github.com/zeromq/libzmq/releases/download/v4.2.2/zeromq-4.2.2.tar.gz')"
## Add missing tools
spack view -d false -v add /salus/packages pkgconf
## Then install all dependencies
spack install cppzmq@4.2.2 zeromq@4.2.2 protobuf@3.4.1~shared boost@1.66.0 nlohmann-json@3.1.2 gperftools@2.7
## Install bazel
curl -JOL "https://github.com/bazelbuild/bazel/releases/download/0.5.4/bazel_0.5.4-linux-x86_64.deb"
apt install bash-completion
dpkg -i bazel_0.5.4-linux-x86_64.deb
# create a virtualenv for tensorflow
apt install -y python-virtualenv
virtualenv /salus/tfbuild
source /salus/tfbuild/bin/activate
# build tensorflow
cd tensorflow
## map dependencies into source tree
spack view -d false -v add spack-packages cppzmq libsodium zeromq
## install python dependencies
pip install six numpy wheel mock
## initialize build, don't answer yes when asked if to edit the file, there's an error that need to be fiex
inv init
## instead, manually edit the file.
## check the following variables are set correctly
## PYTHON_BIN_PATH: /salus/tfbuild/bin/python
## TF_CUDA_VERSION: 9.1
## CUDA_TOOLKIT_PATH: /usr/local/cuda
## TF_CUDNN_VERSION: 7
## CUDNN_INSTALL_PATH: /usr/lib/x86_64-linux-gnu
## GCC_HOST_COMPILER_PATH: /usr/bin/gcc-5
## TF_CUDA_COMPUTE_CAPABILITIES: <set according to your device>
vim invoke.yml
## configure the build system, no questions should be asked by the command
## if you set variables correctly in the previous step
inv cf
## build, install and save the wheel package to ~/downloads
inv bbi --save
# build salus
cd /salus/salus
git checkout develop
## map dependencies
spack view -d false -v add spack-packages cppzmq zeromq boost nlohmann-json protobuf gperftools
## some python dependencies for testing
pip install -r requirements.txt
## configure & install
mkdir -p build/Release && cd build/Release
export CC=gcc-7 CXX=g++-7
cmake -DCMAKE_BUILD_TYPE=Release -DTENSORFLOW_ROOT=/salus/tensorflow ../..
make -j
If everything goes well, you should have a binary at src/executor, it will listen on 5501 port after startup. Ctrl-C can stop it.
Run test workloads
Before you run your own workloads. You can try to run some test workloads to verify the system is correctly compiled. The helper script that I used to run experiments for my paper is also included in the repo.
It expects certain layout of workload scripts that you can setup as below:
cd /salus
git clone https://github.com/Aetf/tf_benchmarks.git
Then you can go back to salus folder. The script assumes certain hardware layout of the system that you can override by set CUDA_VISIBLE_DEVICES=0.
cd /salus/salus
export CUDA_VISIBLE_DEVICES=0
# 25 is the batch size, 20 is the batch num
./bc one vgg16 25 20 --force_preset MostEfficient
Run user scripts
Using the tfbuild virtualenv, you can now run user scripts. Instead of creating a local session, create a session with target "zrpc://tcp://127.0.0.1:5501", so it will connect to salus. For now, you have to mark the training iteration with a noop operation named "salus-marker", but this requirement will be removed later. Just create the operation before iterations and then add it to session.run. Something like:
session.run([train_op, marker_op])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment