Skip to content

Instantly share code, notes, and snippets.

@jorgemf
Last active July 19, 2018 14:07
Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
Dockerfile to compile TensorFlow Serving 1.6 using GPU
# docker build --pull -t tf/tensorflow-serving --label 1.6 -f Dockerfile .
# export TF_SERVING_PORT=9000
# export TF_SERVING_MODEL_PATH=/tf_models/mymodel
# export CONTAINER_NAME=tf_serving_1_6
# CUDA_VISIBLE_DEVICES=0 docker run --runtime=nvidia -it -p $TF_SERVING_PORT:$TF_SERVING_PORT -v $TF_SERVING_MODEL_PATH:/root/tf_model --name $CONTAINER_NAME tf/tensorflow-serving /usr/local/bin/tensorflow_model_server --port=$TF_SERVING_PORT --enable_batching=true --model_base_path=/root/tf_model/
# docker start -ai $CONTAINER_NAME
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
# CUDA and CUDNN versions (must match the image source)
ENV TF_CUDA_VERSION=9.0 \
TF_CUDNN_VERSION=7 \
TF_SERVING_COMMIT=tags/1.6.0 \
BAZEL_VERSION=0.11.1
# Set up ubuntu packages
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
mlocate \
pkg-config \
python-dev \
python-numpy \
python-pip \
software-properties-common \
swig \
zip \
zlib1g-dev \
libcurl3-dev \
openjdk-8-jdk\
openjdk-8-jre-headless \
wget \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Set up grpc
RUN pip install mock grpcio
# Set up Bazel.
# Running bazel inside a `docker build` command causes trouble, cf: https://github.com/bazelbuild/bazel/issues/134
RUN echo "startup --batch" >>/root/.bazelrc
# Similarly, we need to workaround sandboxing issues: https://github.com/bazelbuild/bazel/issues/418
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" >>/root/.bazelrc
ENV BAZELRC /root/.bazelrc
# Install the most recent bazel release.
WORKDIR /bazel
RUN curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
chmod +x bazel-*.sh && \
./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
# Fix paths so that CUDNN can be found: https://github.com/tensorflow/tensorflow/issues/8264
WORKDIR /
RUN mkdir /usr/lib/x86_64-linux-gnu/include/ && \
ln -s /usr/lib/x86_64-linux-gnu/include/cudnn.h /usr/lib/x86_64-linux-gnu/include/cudnn.h && \
ln -s /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h && \
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so && \
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.$TF_CUDNN_VERSION /usr/local/cuda/lib64/libcudnn.so.$TF_CUDNN_VERSION
# Enable CUDA support
ENV TF_NEED_CUDA=1 \
TF_CUDA_COMPUTE_CAPABILITIES="3.0,3.5,5.2,6.0,6.1" \
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
# Download TensorFlow Serving
WORKDIR /tensorflow
RUN git clone --recurse-submodules https://github.com/tensorflow/serving
WORKDIR /tensorflow/serving
RUN git checkout $TF_SERVING_COMMIT
# Build TensorFlow Serving
WORKDIR /tensorflow/serving
RUN bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k --verbose_failures --crosstool_top=@local_config_cuda//crosstool:toolchain tensorflow_serving/model_servers:tensorflow_model_server
# Install tensorflow_model_server and clean bazel
RUN cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ && \
bazel clean --expunge
CMD ["/bin/bash"]
@qiaohaijun
Copy link

good job

@taojian1989
Copy link

I successfully build the images.but when I run it, cp the model to running container, and run command "/root/serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name="default" --model_base_path="/data/serving_model/"‘’
there is a error log:
2018-03-01 02:23:39.972222: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUresult(-1)
2018-03-01 02:23:39.972285: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
it looks like serving can not find the GPU?
ps:my os is ubuntu 16.04, and cuda is 9.0,cudnn is 7.0.

@jorgemf
Copy link
Author

jorgemf commented Mar 20, 2018

@taojian1989 you need the nvidia runtime for docker, take a look at the first lines that describes how to build and run the container.

@samuelswu
Copy link

Thanks for putting this together. I built it also, but I get this error when running serving. host OS is Ubuntu 16.04 with CUDA 9.0 and nvidia-docker2 version 2.0.3+docker18.03.0-1 and docker-ce version 18.03.0ce-0ubuntu. seems to have trouble finding libcuda.so. GPU does not seem to be used.

2018-04-06 18:31:11.489706: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /root/tf_model/1
2018-04-06 18:31:11.490992: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUresult(-1)
2018-04-06 18:31:11.491030: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: edc550abaa6b
2018-04-06 18:31:11.491045: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: edc550abaa6b
2018-04-06 18:31:11.491143: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2018-04-06 18:31:11.491198: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017

@jorgemf
Copy link
Author

jorgemf commented Apr 9, 2018

@samuelswu There is a docker file in the serving repo, they do some changes for the CUDA issue: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/tools/docker/Dockerfile.devel-gpu

I will try to do them and test it.

@maorzalt
Copy link

Remove this line from the dockerfile:
ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/libcuda.so.1

@jorgemf
Copy link
Author

jorgemf commented Apr 11, 2018

Updated to 1.6 with GPU support.

@maorzalt the problem wasn't that line, I added that one to fix an issue but it wasn't enough to make it work with the GPU

@pimp89
Copy link

pimp89 commented Apr 24, 2018

I've tried to build TS Serving from your file, but I got this error during bazel compilation :

ERROR: /root/.cache/bazel/_bazel_root/d9c8385ec38b40593868ab263ecdc773/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:68:1:
Couldn't build file external/org_tensorflow/tensorflow/contrib/nccl/_objs/nccl_kernels/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_ops.o:
C++ compilation of rule '@org_tensorflow//tensorflow/contrib/nccl:nccl_kernels' failed (Exit 1):
crosstool_wrapper_driver_is_not_gcc failed: error executing command

Any idea how to deal with that or what could have change that nccl is not building properly ?
I've researched that problem and one workaround is to comment out the DEP for nccl in: tensorflow/tensorflow/contrib/BUILD. but it requiers the change @org_tensorflow variable.

@wojss
Copy link

wojss commented Apr 25, 2018

I always build failed,build to this step and exit. I don't know why this is so and there is no error message.

external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(169): warning: __device__ annotation on a defaulted function("scalar_left") is ignored
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(199): warning: __host__ annotation on a defaulted function("scalar_right") is ignored
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(199): warning: __device__ annotation on a defaulted function("scalar_right") is ignored

[4,435 / 4,442] 3 actions running
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
INFO: Elapsed time: 813.215s, Critical Path: 552.69s
FAILED: Build did NOT complete successfully

The problem has been solved.
Build failed because the latest version is 1.7.

@kondrashov-do
Copy link

kondrashov-do commented May 1, 2018

@jorgemf Thank you for the script.
Managed to build it on AWS EC2, p2.xlarge, DL AMI with CUDA 9.
The only detail, I checked out specific tag 1.6 of tf/serving repository

RUN git clone --recurse-submodules https://github.com/tensorflow/serving
WORKDIR /tensorflow/serving
RUN git checkout tags/1.6.0

Hope it will help somebody!

@jorgemf
Copy link
Author

jorgemf commented May 3, 2018

thanks @kondrashov-do I guess I forgot that!

@Venkrishna
Copy link

There is a dependency on "software-properties-common" in line#33 to "python-pip" in line#32. This is breaking the pip installation...I swapped the lines and pip installed gracefully. This may need to changed.

@discordianfish
Copy link

Looks like the is no nvcc 9.0 available anymore, so I've updated cuda to 9.2 but now it fails to link the binary:

INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (127 packages loaded).
INFO: Found 1 target...
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 7s local

[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 8s local
ERROR: /serving/tensorflow_serving/model_servers/BUILD:270:1: Linking of rule '//tensorflow_serving/model_servers:tensorflow_model_server' failed (Exit 1)
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `std::_Function_handler<void (), tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
nccl_manager.cc:(.text._ZNSt17_Function_handlerIFvvEZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS2_10NcclStreamEEUlvE_E9_M_invokeERKSt9_Any_data+0x141): undefined reference to `ncclGetErrorString'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x23d): undefined reference to `ncclAllReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x31f): undefined reference to `ncclReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x383): undefined reference to `ncclBcast'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `void std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > >::_M_realloc_insert<tensorflow::NcclManager::Communicator*>(__gnu_cxx::__normal_iterator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >*, std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > > >, tensorflow::NcclManager::Communicator*&&)':
nccl_manager.cc:(.text._ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_[_ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_]+0x159): undefined reference to `ncclCommDestroy'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::GetCommunicator(tensorflow::NcclManager::Collective*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x565): undefined reference to `ncclGetUniqueId'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x574): undefined reference to `ncclGroupStart'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x66b): undefined reference to `ncclCommInitRank'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x6ed): undefined reference to `ncclGetErrorString'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xc73): undefined reference to `ncclCommInitAll'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xe46): undefined reference to `ncclGroupEnd'
collect2: error: ld returned 1 exit status
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 48.712s, Critical Path: 19.38s
FAILED: Build did NOT complete successfully

My Dockerfile is pretty much the same a s https://raw.githubusercontent.com/tensorflow/serving/c8cc43b/tensorflow_serving/tools/docker/Dockerfile.devel-gpu but based on cuda 9.2 and using libpng-dev instead of libpng12-dev because my cuda 9.2 image is based on ubuntu 18.04.

I'd test with 9.1 and other ubuntu and/or serving versions, but the build is so aweful slow (>2h on my rather beefy laptop)..

@jorgemf
Copy link
Author

jorgemf commented Jun 12, 2018

@discordianfish any change you make in the dockerfile can make it not work. For example, I think latest ubuntu version doesn't have cuda 9.0, which is required to compile TF serving (because of some dirvers issues I think). So if you changed that it wont work unless you add the necessary workarounds.

@jorgemf
Copy link
Author

jorgemf commented Jun 12, 2018

@ps-account
Copy link

ps-account commented Jul 19, 2018

For some reason this works better than the official docker file for my case. the "latest-devel-gpu" TF serving pulled from the docker repo doesn't recognize my CUDA device, whereas this one does. Even when building the official latest-devel-gpu from github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment