Skip to content

Instantly share code, notes, and snippets.

@jorgemf
Last active July 19, 2018 14:07
Show Gist options
  • Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
Dockerfile to compile TensorFlow Serving 1.6 using GPU
# docker build --pull -t tf/tensorflow-serving --label 1.6 -f Dockerfile .
# export TF_SERVING_PORT=9000
# export TF_SERVING_MODEL_PATH=/tf_models/mymodel
# export CONTAINER_NAME=tf_serving_1_6
# CUDA_VISIBLE_DEVICES=0 docker run --runtime=nvidia -it -p $TF_SERVING_PORT:$TF_SERVING_PORT -v $TF_SERVING_MODEL_PATH:/root/tf_model --name $CONTAINER_NAME tf/tensorflow-serving /usr/local/bin/tensorflow_model_server --port=$TF_SERVING_PORT --enable_batching=true --model_base_path=/root/tf_model/
# docker start -ai $CONTAINER_NAME
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
# CUDA and CUDNN versions (must match the image source)
ENV TF_CUDA_VERSION=9.0 \
TF_CUDNN_VERSION=7 \
TF_SERVING_COMMIT=tags/1.6.0 \
BAZEL_VERSION=0.11.1
# Set up ubuntu packages
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
mlocate \
pkg-config \
python-dev \
python-numpy \
python-pip \
software-properties-common \
swig \
zip \
zlib1g-dev \
libcurl3-dev \
openjdk-8-jdk\
openjdk-8-jre-headless \
wget \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Set up grpc
RUN pip install mock grpcio
# Set up Bazel.
# Running bazel inside a `docker build` command causes trouble, cf: https://github.com/bazelbuild/bazel/issues/134
RUN echo "startup --batch" >>/root/.bazelrc
# Similarly, we need to workaround sandboxing issues: https://github.com/bazelbuild/bazel/issues/418
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" >>/root/.bazelrc
ENV BAZELRC /root/.bazelrc
# Install the most recent bazel release.
WORKDIR /bazel
RUN curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
chmod +x bazel-*.sh && \
./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
# Fix paths so that CUDNN can be found: https://github.com/tensorflow/tensorflow/issues/8264
WORKDIR /
RUN mkdir /usr/lib/x86_64-linux-gnu/include/ && \
ln -s /usr/lib/x86_64-linux-gnu/include/cudnn.h /usr/lib/x86_64-linux-gnu/include/cudnn.h && \
ln -s /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h && \
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so && \
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.$TF_CUDNN_VERSION /usr/local/cuda/lib64/libcudnn.so.$TF_CUDNN_VERSION
# Enable CUDA support
ENV TF_NEED_CUDA=1 \
TF_CUDA_COMPUTE_CAPABILITIES="3.0,3.5,5.2,6.0,6.1" \
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
# Download TensorFlow Serving
WORKDIR /tensorflow
RUN git clone --recurse-submodules https://github.com/tensorflow/serving
WORKDIR /tensorflow/serving
RUN git checkout $TF_SERVING_COMMIT
# Build TensorFlow Serving
WORKDIR /tensorflow/serving
RUN bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k --verbose_failures --crosstool_top=@local_config_cuda//crosstool:toolchain tensorflow_serving/model_servers:tensorflow_model_server
# Install tensorflow_model_server and clean bazel
RUN cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ && \
bazel clean --expunge
CMD ["/bin/bash"]
@discordianfish
Copy link

Looks like the is no nvcc 9.0 available anymore, so I've updated cuda to 9.2 but now it fails to link the binary:

INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (127 packages loaded).
INFO: Found 1 target...
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 7s local

[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 8s local
ERROR: /serving/tensorflow_serving/model_servers/BUILD:270:1: Linking of rule '//tensorflow_serving/model_servers:tensorflow_model_server' failed (Exit 1)
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `std::_Function_handler<void (), tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
nccl_manager.cc:(.text._ZNSt17_Function_handlerIFvvEZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS2_10NcclStreamEEUlvE_E9_M_invokeERKSt9_Any_data+0x141): undefined reference to `ncclGetErrorString'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x23d): undefined reference to `ncclAllReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x31f): undefined reference to `ncclReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x383): undefined reference to `ncclBcast'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `void std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > >::_M_realloc_insert<tensorflow::NcclManager::Communicator*>(__gnu_cxx::__normal_iterator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >*, std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > > >, tensorflow::NcclManager::Communicator*&&)':
nccl_manager.cc:(.text._ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_[_ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_]+0x159): undefined reference to `ncclCommDestroy'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::GetCommunicator(tensorflow::NcclManager::Collective*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x565): undefined reference to `ncclGetUniqueId'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x574): undefined reference to `ncclGroupStart'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x66b): undefined reference to `ncclCommInitRank'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x6ed): undefined reference to `ncclGetErrorString'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xc73): undefined reference to `ncclCommInitAll'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xe46): undefined reference to `ncclGroupEnd'
collect2: error: ld returned 1 exit status
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 48.712s, Critical Path: 19.38s
FAILED: Build did NOT complete successfully

My Dockerfile is pretty much the same a s https://raw.githubusercontent.com/tensorflow/serving/c8cc43b/tensorflow_serving/tools/docker/Dockerfile.devel-gpu but based on cuda 9.2 and using libpng-dev instead of libpng12-dev because my cuda 9.2 image is based on ubuntu 18.04.

I'd test with 9.1 and other ubuntu and/or serving versions, but the build is so aweful slow (>2h on my rather beefy laptop)..

@jorgemf
Copy link
Author

jorgemf commented Jun 12, 2018

@discordianfish any change you make in the dockerfile can make it not work. For example, I think latest ubuntu version doesn't have cuda 9.0, which is required to compile TF serving (because of some dirvers issues I think). So if you changed that it wont work unless you add the necessary workarounds.

@jorgemf
Copy link
Author

jorgemf commented Jun 12, 2018

@ps-account
Copy link

ps-account commented Jul 19, 2018

For some reason this works better than the official docker file for my case. the "latest-devel-gpu" TF serving pulled from the docker repo doesn't recognize my CUDA device, whereas this one does. Even when building the official latest-devel-gpu from github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment