mattiasarro/tensorflow_1_6_high_sierra_gpu.md

## tensorflow_1_6_high_sierra_gpu.md

      
    Raw
  

              tensorflow_1_6_high_sierra_gpu.md
            
          
    Tensorflow 1.6 on macOS High Sierra 10.13.3 with GPU Acceleration (without disabling SIP)

This gist (based on a blog post at byai.io) documents how to set up TensorFlow 1.6 with (e)GPU support without the need to disable SIP. Following the original gist got me a saystem in which training TF on eGPU was successful, but there were various visual glitches due to the newer / less stable version of the driver.
As pointed out by ronchigram, many people are having issues with newer NVIDIA drivers, so it's worth using the nvidia-update script by Benjamin Dobell that installs the latest stable NVIDIA web driver, and if necessary patches it to run on your system. We also don't need to disable SIP when using nvidia-update.
I have also uploaded the wheel files I built to Google Drive. You can skip installing Apple Command-Line-Tools and Bazel and try installing a pre-built wheel - if that fails, try compiling as described below.
Requirements


NVIDIA Web-Drivers (387.10.10.10.25.106, the latest stable version at the time of writing)
CUDA-Drivers
CUDA 9.1 Toolkit
cuDNN 7
Python 2.7 or 3.6
Apple Command-Line-Tools 8.3.2
bazel 0.8.1

Note that the install procedure is very sensitive to the specific versions of these packages.
NVIDIA Web Drivers

Install the latest stable NVIDIA Web Drivers.
$ bash <(curl -s https://raw.githubusercontent.com/Benjamin-Dobell/nvidia-update/master/nvidia-update.sh)
eGPU Support (optional)

Check your macOS build version:
$ system_profiler SPSoftwareDataType
Download and install the NVIDIAEGPUSupport that matches your build version from egpu.io. The table lists newer NVIDIA driver versions, but using an older version should also work (in my case using NVIDIA driver version 387.10.10.10.25.106 for nvidia-egpu-v7.zip that is meant for 387.10.10.10.25.158).
Further Dependencies

Homebrew & Coreutils

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
$ brew install coreutils

bazel 0.8.1


Do not install bazel with Homebrew

$ mkdir ~/temp && cd ~/temp
$ curl -L https://github.com/bazelbuild/bazel/releases/download/0.8.1/bazel-0.8.1-installer-darwin-x86_64.sh -o bazel-0.8.1-installer-darwin-x86_64.sh
$ chmod +x bazel-0.8.1-installer-darwin-x86_64.sh 
$ ./bazel-0.8.1-installer-darwin-x86_64.sh

Downgrading Command-Line-Tools to Version 8.3.2

Because we need an older version of clang, unfortunately, we have to downgrade to an older version of the Apple Command-Line-Tools.
You can download the older version 8.3.2 directly from the Apple Developer Portal.
$ sudo mv /Library/Developer/CommandLineTools /Library/Developer/CommandLineTools_backup
$ sudo xcode-select --switch /Library/Developer/CommandLineTools

Install CUDA Toolkit 9.1 with Cuda Drivers

Download CUDA-9.1
$ vim ~/.bash_profile
    # add to .bash_profile
    export CUDA_HOME=/usr/local/cuda
    export PATH=/usr/local/cuda/bin:/Developer/NVIDIA/CUDA-9.1/bin${PATH:+:${PATH}}
    export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:/Developer/NVIDIA/CUDA-9.1/lib:/usr/local/cuda/extras/CUPTI/lib
    export LD_LIBRARY_PATH=$DYLD_LIBRARY_PATH
$ source ~/.bash_profile

Let`s check if the driver is loaded.
$ kextstat | grep -i cuda
164    0 0xffffff7f83c65000 0x2000     0x2000     com.nvidia.CUDA (1.1.0) 4329B052-6C8A-3900-8E83-744487AEDEF1 <4 1>

Compile Samples

We want to compile some CUDA sample to check if the GPU is correctly recognized and supported.
$ cd /Developer/NVIDIA/CUDA-9.1/samples
$ make -C 1_Utilities/deviceQuery
$ /Developer/NVIDIA/CUDA-9.1/samples/bin/x86_64/darwin/release/deviceQuery

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6144 MBytes (6442254336 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 195 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

NVIDIA cuDNN - Deep Learning Primitives

If not already done, register at https://developer.nvidia.com/cudnn
Download cuDNN 7.0.5¹
Change into your download directory and follow the post installation steps.
$ tar -xzvf cudnn-9.1-osx-x64-v7-ga.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*

Python 2.7 / 3.6

I recommend using pyenv for installing Python. On top of that, I will use pyenv-virtualenv to create a virtual environment for the custom build.
$ brew update
$ brew install pyenv pyenv-virtualenv

    # add to bottom of `.bash_profile`
    if command -v pyenv 1>/dev/null 2>&1; then
      eval "$(pyenv init -)"
      eval "$(pyenv virtualenv-init -)"
    fi

$ source .bash_profile
$ pyenv install 3.6.0
$ pyenv install 2.7.0

# create virtualenv
$ pyenv virtualenv 3.6.0 p3gpu
$ pyenv virtualenv 2.7.0 p2gpu

Clone TensorFlow from Repository

$ git clone https://github.com/tensorflow/tensorflow
$ cd tensorflow
$ git checkout v1.6.0

Apply Patch

Unfortunately, with the repo untouched, it will fail to build. Grab the patch for tensorflow 1.6 and apply it.
$ git apply tensorflow_v1.6.0osx.patch
Python dependencies

We first compile for Python 3.6. To also install for Python 2.7, run all the following steps, substituting p3gpu with p2gpu.
$ pyenv activate p3gpu
$ pip install six numpy wheel

Install pre-built packages

Compiling TensorFlow can be time-consuming, so it's worth trying a pre-built package first. I have uploaded the wheel files I built to Google Drive, so download the relevant file, activate the relevant Python environment, and
$ pip install /path/to/tensorflow-1.6.0-cp36-cp36m-macosx_10_13_x86_64.whl # Python 3.6
$ pip install /path/to/tensorflow-1.6.0-cp27-cp27m-macosx_10_13_x86_64.whl # Python 2.7

Prepare Build

Except CUDA support, CUDA SDK version and Cuda compute capabilities, I left the other settings untouched.
$ ./configure
You have bazel 0.8.1 installed.
Please specify the location of python. [Default is /Users/user/.pyenv/versions/p3gpu/bin/python]: 


Found possible Python library paths:
  /Users/user/.pyenv/versions/p3gpu/lib/python3.6/site-packages
Please input the desired Python library path to use.  Default is [/Users/user/.pyenv/versions/p3gpu/lib/python3.6/site-packages]

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [y/N]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1


Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:


Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]6.1


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: 
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
Configuration finished

Build Process

$ bazel build --config=cuda --config=opt --action_env PATH --action_env LD_LIBRARY_PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package
Create wheel file and install it

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/
$ pip install ~/tensorflow-1.6.0-cp36-cp36m-macosx_10_13_x86_64.whl

It's useful to leave the .whl file lying around in case you want to install it for another environment.
Test Installation

Get it a shoot an open the python interpreter...
>>> import tensorflow as tf
>>> tf.__version__
'1.6.0'

>>> if.Session()
...
tensorflow/core/common_runtime/gpu/gpu_device.cc:1331] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:c3:00.0
totalMemory: 6.00GiB freeMemory: 5.91GiB 
...
tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5699 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:c3:00.0, compute capability: 6.1)
...

Test GPU Acceleration

$ pip install keras
$ git clone https://github.com/fchollet/keras.git
$ cd keras/examples
$ python imdb_cnn.py

I use iStat Menus to see CPU/GPU utilization. It doesn't report eGPU memory appropriately but GPU processor utilization info is pretty much real-time.
Footnotes


Detailed installation instructions are available at: cuDNN-Installation-Guide.pdf ↩


## tensorflow_v1.6.0osx.patch
diff --git a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
index 0f7adaf24a..8d89c66f3f 100644
--- a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
+++ b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
@@ -69,7 +69,7 @@ __global__ void concat_variable_kernel(
   IntType num_inputs = input_ptr_data.size;

   // verbose declaration needed due to template
-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
+  extern __shared__ unsigned char smem[];
   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

   if (useSmem) {
diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
index 126b64f73d..e5f3fd4e9f 100644
--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
@@ -164,7 +164,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int batches = args.batch;
@@ -434,7 +434,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int batches = args.batch;
@@ -1054,7 +1054,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int batches = args.batch;
@@ -1313,7 +1313,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int batches = args.batch;
diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
index 9f234fc093..5115a96d17 100644
--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
@@ -119,7 +119,7 @@ __global__ void split_v_kernel(const T* input_ptr,
   int num_outputs = output_ptr_data.size;

   // verbose declaration needed due to template
-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
+  extern __shared__ unsigned char smem[];
   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

   if (useSmem) {
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 70cb65f3e7..f662831447 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -120,11 +120,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "eigen_archive",
       urls = [
-          "https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/14e1418fcf12.tar.gz",
-          "https://bitbucket.org/eigen/eigen/get/14e1418fcf12.tar.gz",
+          "https://mirror.bazel.build/bitbucket.org/dtrebbien/eigen/get/374842a18727.tar.gz",
+          "https://bitbucket.org/dtrebbien/eigen/get/374842a18727.tar.gz",
       ],
-      sha256 = "2b526c6888639025323fd4f2600533c0f982d304ea48e4f1663e8066bd9f6368",
-      strip_prefix = "eigen-eigen-14e1418fcf12",
+      sha256 = "fa26e9b9ff3a2692b092d154685ec88d6cb84d4e1e895006541aff8603f15c16",
+      strip_prefix = "dtrebbien-eigen-374842a18727",
       build_file = str(Label("//third_party:eigen.BUILD")),
   )

@@ -353,11 +353,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "protobuf_archive",
       urls = [
-          "https://mirror.bazel.build/github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
-          "https://github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
+          "https://mirror.bazel.build/github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
+          "https://github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
       ],
-      sha256 = "846d907acf472ae233ec0882ef3a2d24edbbe834b80c305e867ac65a1f2c59e3",
-      strip_prefix = "protobuf-396336eb961b75f03b25824fe86cf6490fb75e3a",
+      sha256 = "eb16b33431b91fe8cee479575cee8de202f3626aaf00d9bf1783c6e62b4ffbc7",
+      strip_prefix = "protobuf-50f552646ba1de79e07562b41f3999fe036b4fd0",
   )

   # We need to import the protobuf library under the names com_google_protobuf
diff --git a/third_party/gpus/cuda/BUILD.tpl b/third_party/gpus/cuda/BUILD.tpl
index 2a37c65bc7..61b203e005 100644
--- a/third_party/gpus/cuda/BUILD.tpl
+++ b/third_party/gpus/cuda/BUILD.tpl
@@ -110,7 +110,7 @@ cc_library(
         ".",
         "cuda/include",
     ],
-    linkopts = ["-lgomp"],
+    # linkopts = ["-lgomp"],
     linkstatic = 1,
     visibility = ["//visibility:public"],
 )
	diff --git a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	index 0f7adaf24a..8d89c66f3f 100644
	--- a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	+++ b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	@@ -69,7 +69,7 @@ __global__ void concat_variable_kernel(
	IntType num_inputs = input_ptr_data.size;

	// verbose declaration needed due to template
	- extern __shared__ __align__(sizeof(T)) unsigned char smem[];
	+ extern __shared__ unsigned char smem[];
	IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

	if (useSmem) {
	diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	index 126b64f73d..e5f3fd4e9f 100644
	--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	@@ -164,7 +164,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
	const DepthwiseArgs args, const T* input, const T* filter, T* output) {
	assert(CanLaunchDepthwiseConv2dGPUSmall(args));
	// Holds block plus halo and filter data for blockDim.x depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int batches = args.batch;
	@@ -434,7 +434,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
	const DepthwiseArgs args, const T* input, const T* filter, T* output) {
	assert(CanLaunchDepthwiseConv2dGPUSmall(args));
	// Holds block plus halo and filter data for blockDim.z depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int batches = args.batch;
	@@ -1054,7 +1054,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
	const DepthwiseArgs args, const T* output, const T* input, T* filter) {
	assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
	// Holds block plus halo and filter data for blockDim.x depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int batches = args.batch;
	@@ -1313,7 +1313,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
	const DepthwiseArgs args, const T* output, const T* input, T* filter) {
	assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
	// Holds block plus halo and filter data for blockDim.z depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int batches = args.batch;
	diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
	index 9f234fc093..5115a96d17 100644
	--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
	+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
	@@ -119,7 +119,7 @@ __global__ void split_v_kernel(const T* input_ptr,
	int num_outputs = output_ptr_data.size;

	// verbose declaration needed due to template
	- extern __shared__ __align__(sizeof(T)) unsigned char smem[];
	+ extern __shared__ unsigned char smem[];
	IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

	if (useSmem) {
	diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
	index 70cb65f3e7..f662831447 100644
	--- a/tensorflow/workspace.bzl
	+++ b/tensorflow/workspace.bzl
	@@ -120,11 +120,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
	tf_http_archive(
	name = "eigen_archive",
	urls = [
	- "https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/14e1418fcf12.tar.gz",
	- "https://bitbucket.org/eigen/eigen/get/14e1418fcf12.tar.gz",
	+ "https://mirror.bazel.build/bitbucket.org/dtrebbien/eigen/get/374842a18727.tar.gz",
	+ "https://bitbucket.org/dtrebbien/eigen/get/374842a18727.tar.gz",
	],
	- sha256 = "2b526c6888639025323fd4f2600533c0f982d304ea48e4f1663e8066bd9f6368",
	- strip_prefix = "eigen-eigen-14e1418fcf12",
	+ sha256 = "fa26e9b9ff3a2692b092d154685ec88d6cb84d4e1e895006541aff8603f15c16",
	+ strip_prefix = "dtrebbien-eigen-374842a18727",
	build_file = str(Label("//third_party:eigen.BUILD")),
	)

	@@ -353,11 +353,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
	tf_http_archive(
	name = "protobuf_archive",
	urls = [
	- "https://mirror.bazel.build/github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
	- "https://github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
	+ "https://mirror.bazel.build/github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
	+ "https://github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
	],
	- sha256 = "846d907acf472ae233ec0882ef3a2d24edbbe834b80c305e867ac65a1f2c59e3",
	- strip_prefix = "protobuf-396336eb961b75f03b25824fe86cf6490fb75e3a",
	+ sha256 = "eb16b33431b91fe8cee479575cee8de202f3626aaf00d9bf1783c6e62b4ffbc7",
	+ strip_prefix = "protobuf-50f552646ba1de79e07562b41f3999fe036b4fd0",
	)

	# We need to import the protobuf library under the names com_google_protobuf
	diff --git a/third_party/gpus/cuda/BUILD.tpl b/third_party/gpus/cuda/BUILD.tpl
	index 2a37c65bc7..61b203e005 100644
	--- a/third_party/gpus/cuda/BUILD.tpl
	+++ b/third_party/gpus/cuda/BUILD.tpl
	@@ -110,7 +110,7 @@ cc_library(
	".",
	"cuda/include",
	],
	- linkopts = ["-lgomp"],
	+ # linkopts = ["-lgomp"],
	linkstatic = 1,
	visibility = ["//visibility:public"],
	)