pavelmalik/tensorflow_1_7_high_sierra_gpu.md

## tensorflow_1_7_high_sierra_gpu.md

      
    Raw
  

              tensorflow_1_7_high_sierra_gpu.md
            
          
    Tensorflow 1.7 with CUDA on macOS High Sierra 10.13.3 and default python 2.7

Largely based on the Tensorflow 1.6 gist, this should hopefully simplify things a bit. Mixing homebrew python2/python3 with pip ends up being a mess, so here's an approach to uses the built-in python27.
Requirements


NVIDIA Web-Drivers 387.10.10.10.25.156 for 10.13.3
CUDA-Drivers 387.178
CUDA 9.1 Toolkit
cuDNN 7.0.5 (latest release for mac os)
Python 2.7
XCode 8.3.2
bazel 0.10.0
Tensorflow 1.7

NVIDIA Graphics driver

Download and install from http://www.nvidia.com/download/driverResults.aspx/130460/en-us
NVIDIA Cuda driver

Download and install from http://www.nvidia.com/object/macosx-cuda-387.178-driver.html
Downgrade to XCode 8.3.2

I was able to compile all of it on XCode9, but tensorflow promptly segfaults if you actually try to do anything on the gpu.
You may need a developer account to grab the old version https://developer.apple.com/download/more/
If you have newer Xcode installed, rename the XCode.app to something like Xcode9.app
Unpack XCode 8.3.2 and switch the tool chain over to it:
sudo xcode-select -s /Applications/Xcode.app

Install Bazel 0.10

Download the binary here
chmod 755 bazel-0.10.0-installer-darwin-x86_64.sh
./bazel-0.10.0-installer-darwin-x86_64.sh

Install CUDA Toolkit 9.1

Download CUDA-9.1
It should be something along the lines of cuda_9.1.128_mac.dmg
Set up your env paths

Edit ~/.bash_profile and add the following:
export CUDA_HOME=/usr/local/cuda
export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib 
export LD_LIBRARY_PATH=$DYLD_LIBRARY_PATH
export PATH=$DYLD_LIBRARY_PATH:$PATH:/Developer/NVIDIA/CUDA-9.1/bin


you may have to run source ~/.bash_profile to verify the LD paths are set:
source .bash_profile 
echo $LD_LIBRARY

pmalik@MacPro:~$ echo $LD_LIBRARY_PATH
/Users/pmalik/lib:/usr/local/opt/libomp/lib:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib


Compile Samples

We want to compile some CUDA sample to check if the GPU is correctly recognized and supported.
cd /Developer/NVIDIA/CUDA-9.1/samples
chown -R YOURUSERNAMEHERE *
make -C 1_Utilities/deviceQuery
./Developer/NVIDIA/CUDA-9.1/samples/bin/x86_64/darwin/release/deviceQuery

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6144 MBytes (6442254336 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 195 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

NVIDIA cuDNN - Deep Learning Primitives

If not already done, register at https://developer.nvidia.com/cudnn
Download cuDNN 7.0.5
Change into your download directory and follow the post installation steps.
tar -xzvf cudnn-9.1-osx-x64-v7-ga.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*

Install pip for python 2.7

Download get-pip and run it in python. More info here
python get-pip.py

If I remeber correctly, pip will automatically install the tensorflow dependencies (wheel, six etc)
Clone TensorFlow from Repository

git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout v1.7.0

Apply Patch

Apply the following patch to fix a couple build issues:
git apply xtensorflow17macos.patch
Configure Build

Except CUDA support, CUDA SDK version and Cuda compute capabilities, I left the other settings untouched.
./configure
You have bazel 0.10.0 installed.
Please specify the location of python. [Default is /usr/bin/python]: 


Found possible Python library paths:
  /Library/Python/2.7/site-packages
Please input the desired Python library path to use.  Default is [/Library/Python/2.7/site-packages]

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [y/N]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1


Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:


Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]6.1


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
Configuration finished


Build Process

Takes about 20 minutes on my machine
bazel build --config=cuda --config=opt --action_env PATH --action_env LD_LIBRARY_PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package

Create wheel file and install it

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/
pip install ~/tensorflow-1.7.0-cp27-cp27m-macosx_10_13_intel.whl

It's useful to leave the .whl file lying around in case you want to install it for another environment.
Test Installation

See if everything got linked correctly
>>> import tensorflow as tf
>>> tf.Session()
2018-04-05 23:04:20.457912: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
2018-04-05 23:04:20.458122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.392
pciBusID: 0000:05:00.0
totalMemory: 4.00GiB freeMemory: 2.75GiB
2018-04-05 23:04:20.458143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-05 23:04:20.821699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-05 23:04:20.821728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-04-05 23:04:20.821736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-04-05 23:04:20.821856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2467 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
<tensorflow.python.client.session.Session object at 0x10e186990>

Test GPU Acceleration

pip install keras
git clone https://github.com/fchollet/keras.git
cd keras/examples
python mnist_cnn.py
Using TensorFlow backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2018-04-05 22:38:30.156464: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
2018-04-05 22:38:30.156645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.392
pciBusID: 0000:05:00.0
totalMemory: 4.00GiB freeMemory: 2.98GiB
2018-04-05 22:38:30.156672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-05 22:38:30.519346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-05 22:38:30.519376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-04-05 22:38:30.519383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-04-05 22:38:30.519499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2697 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
2018-04-05 22:38:30.649987: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-04-05 22:38:30.693399: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-04-05 22:38:30.761824: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
59648/60000 [============================>.] - ETA: 0s - loss: 0.2698 - acc: 0.91682018-04-05 22:38:42.071923: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered


You can use cuda-smi to watch the GPU memory usages. In case the of the mnist example in keras, you should see the free memory drop down to maybe 2% and the fans spin up. Not quite sure what the grappler/clusters/utils.cc:127 warning is, however.
pmalik@MacPro:~/cuda-smi$ ./cuda-smi 
Device 0 [PCIe 0:5:0.0]: GeForce GTX 1050 Ti (CC 6.1): 2901.6 of 4095.8 MB (i.e. 70.8%) Free
pmalik@MacPro:~/cuda-smi$ ./cuda-smi 
Device 0 [PCIe 0:5:0.0]: GeForce GTX 1050 Ti (CC 6.1): 2893.1 of 4095.8 MB (i.e. 70.6%) Free
pmalik@MacPro:~/cuda-smi$ ./cuda-smi 
Device 0 [PCIe 0:5:0.0]: GeForce GTX 1050 Ti (CC 6.1): 223.86 of 4095.8 MB (i.e. 5.47%) Free
pmalik@MacPro:~/cuda-smi$ ./cuda-smi 
Device 0 [PCIe 0:5:0.0]: GeForce GTX 1050 Ti (CC 6.1): 97.852 of 4095.8 MB (i.e. 2.39%) Free

Tested on a 2010 Mac Pro (Mid 2010) 10.13.3 (17D47) 2 x 2.93 GHz 6-Core Intel Xeon and NVIDIA GeForce GTX 1050 Ti 4 GB
Misc

If you'd like to build tensorflow with openmp (multi-cpu support),
grab the open mp library via homebrew
brew install cliutils/apple/libomp

and uncomment the -lgomp line /third_party/gpus/cuda/BUILD.tpl
Also you can build the binary to your specific cpu architecure, run this to get a list
bazel build --config=cuda  --config=opt --copt=-march=native --action_env PATH --action_env LD_LIBRARY_PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package

You can run this command to see what instruction sets are getting built
echo | clang -E - -march=native -###


## xtensorflow17macos.patch
diff --git a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
index 0f7adaf24a..934ccbada6 100644
--- a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
+++ b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
@@ -69,7 +69,7 @@ __global__ void concat_variable_kernel(
   IntType num_inputs = input_ptr_data.size;

   // verbose declaration needed due to template
-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

   if (useSmem) {
diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
index 94989089ec..1d26d4bacb 100644
--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
@@ -172,7 +172,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
@@ -452,7 +452,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
@@ -1118,7 +1118,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
@@ -1388,7 +1388,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
index 393818730b..58a1294005 100644
--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
@@ -121,7 +121,7 @@ __global__ void split_v_kernel(const T* input_ptr,
   int num_outputs = output_ptr_data.size;

   // verbose declaration needed due to template
-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

   if (useSmem) {
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 0ce5cda517..d4dc2235ac 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -361,11 +361,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "protobuf_archive",
       urls = [
-          "https://mirror.bazel.build/github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
-          "https://github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
+          "https://mirror.bazel.build/github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
+          "https://github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
       ],
-      sha256 = "846d907acf472ae233ec0882ef3a2d24edbbe834b80c305e867ac65a1f2c59e3",
-      strip_prefix = "protobuf-396336eb961b75f03b25824fe86cf6490fb75e3a",
+      sha256 = "eb16b33431b91fe8cee479575cee8de202f3626aaf00d9bf1783c6e62b4ffbc7",
+      strip_prefix = "protobuf-50f552646ba1de79e07562b41f3999fe036b4fd0",
   )

   # We need to import the protobuf library under the names com_google_protobuf
diff --git a/third_party/gpus/cuda/BUILD.tpl b/third_party/gpus/cuda/BUILD.tpl
index 2a37c65bc7..43446dd99b 100644
--- a/third_party/gpus/cuda/BUILD.tpl
+++ b/third_party/gpus/cuda/BUILD.tpl
@@ -110,7 +110,7 @@ cc_library(
         ".",
         "cuda/include",
     ],
-    linkopts = ["-lgomp"],
+    #linkopts = ["-lgomp"],
     linkstatic = 1,
     visibility = ["//visibility:public"],
 )
	diff --git a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	index 0f7adaf24a..934ccbada6 100644
	--- a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	+++ b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	@@ -69,7 +69,7 @@ __global__ void concat_variable_kernel(
	IntType num_inputs = input_ptr_data.size;

	// verbose declaration needed due to template
	- extern __shared__ __align__(sizeof(T)) unsigned char smem[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
	IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

	if (useSmem) {
	diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	index 94989089ec..1d26d4bacb 100644
	--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	@@ -172,7 +172,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
	const DepthwiseArgs args, const T* input, const T* filter, T* output) {
	assert(CanLaunchDepthwiseConv2dGPUSmall(args));
	// Holds block plus halo and filter data for blockDim.x depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	@@ -452,7 +452,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
	const DepthwiseArgs args, const T* input, const T* filter, T* output) {
	assert(CanLaunchDepthwiseConv2dGPUSmall(args));
	// Holds block plus halo and filter data for blockDim.z depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	@@ -1118,7 +1118,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
	const DepthwiseArgs args, const T* output, const T* input, T* filter) {
	assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
	// Holds block plus halo and filter data for blockDim.x depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	@@ -1388,7 +1388,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
	const DepthwiseArgs args, const T* output, const T* input, T* filter) {
	assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
	// Holds block plus halo and filter data for blockDim.z depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
	index 393818730b..58a1294005 100644
	--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
	+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
	@@ -121,7 +121,7 @@ __global__ void split_v_kernel(const T* input_ptr,
	int num_outputs = output_ptr_data.size;

	// verbose declaration needed due to template
	- extern __shared__ __align__(sizeof(T)) unsigned char smem[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
	IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

	if (useSmem) {
	diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
	index 0ce5cda517..d4dc2235ac 100644
	--- a/tensorflow/workspace.bzl
	+++ b/tensorflow/workspace.bzl
	@@ -361,11 +361,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
	tf_http_archive(
	name = "protobuf_archive",
	urls = [
	- "https://mirror.bazel.build/github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
	- "https://github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
	+ "https://mirror.bazel.build/github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
	+ "https://github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
	],
	- sha256 = "846d907acf472ae233ec0882ef3a2d24edbbe834b80c305e867ac65a1f2c59e3",
	- strip_prefix = "protobuf-396336eb961b75f03b25824fe86cf6490fb75e3a",
	+ sha256 = "eb16b33431b91fe8cee479575cee8de202f3626aaf00d9bf1783c6e62b4ffbc7",
	+ strip_prefix = "protobuf-50f552646ba1de79e07562b41f3999fe036b4fd0",
	)

	# We need to import the protobuf library under the names com_google_protobuf
	diff --git a/third_party/gpus/cuda/BUILD.tpl b/third_party/gpus/cuda/BUILD.tpl
	index 2a37c65bc7..43446dd99b 100644
	--- a/third_party/gpus/cuda/BUILD.tpl
	+++ b/third_party/gpus/cuda/BUILD.tpl
	@@ -110,7 +110,7 @@ cc_library(
	".",
	"cuda/include",
	],
	- linkopts = ["-lgomp"],
	+ #linkopts = ["-lgomp"],
	linkstatic = 1,
	visibility = ["//visibility:public"],
	)