jeanjerome/mnist_test.py

## mnist_test.py
import tensorflow as tf
import keras

config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 1} )
sess = tf.Session(config=config)
keras.backend.set_session(sess)

from keras.datasets import mnist
from autokeras import ImageClassifier

if __name__ == '__main__':
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train = x_train.reshape(x_train.shape + (1,))
    x_test = x_test.reshape(x_test.shape + (1,))

    clf = ImageClassifier(verbose=True, augment=False)
    clf.fit(x_train, y_train, time_limit=12 * 60 * 60)
    clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
    y = clf.evaluate(x_test, y_test)
    print(y * 100)


## tensorflow_1_8_high_sierra_gpu.md

      
    Raw
  

              tensorflow_1_8_high_sierra_gpu.md
            
          
    Tensorflow 1.8 with CUDA GPU and Python 3 on macOS High Sierra 10.13.6

-- software versions up-to-date at 2018-08-18 --

Largely based on the Tensorflow 1.8 gist, Tensorflow 1.6 gist,
and Tensorflow 1.7 gist for xcode, this should hopefully simplify things a bit.
Requirements


NVIDIA Web-Drivers 387.10.10.10.40.105 for 10.13.6
CUDA-Drivers 396.148
CUDA 9.2 Toolkit (9.2.148 + patch 1)
cuDNN 7.2.1.38 (latest for macOS)
NCCL 2.2.13_1 (latest for macOS)
Python 3.6.5
XCode 9.2
bazel stable 0.15.2 (latest on HomeBrew)
Tensorflow 1.8 Source Code

Prepare

Install Homebrew and wget

For package management, ignore if you have your own python, wget or you want to download manually.
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install wget
Install Python 3

brew install python
More explanations on Python's site https://docs.python-guide.org/starting/install3/osx/.
NVIDIA Graphics driver (QUADRO & GEFORCE MACOS DRIVER RELEASE 387.10.10.10.40.105)

One important thing is to install the web driver version and not the native one which is also provided by NVIDIA.
Download and install from https://www.nvidia.com/download/driverResults.aspx/136062/en-us
NVIDIA Cuda driver (NVIDIA CUDA 396.148 FOR MACOS RELEASE)

Download and install from https://www.nvidia.com/object/macosx-cuda-396.148-driver.html

NOTE : Is installed into /usr/local/cuda

Install XCode 9.2 and Command Line Tool 9.2


Download and install from https://download.developer.apple.com/Developer_Tools/Xcode_9.2/Xcode_9.2.xip.
Or Find XCode 9.2 on https://developer.apple.com/download/more/


Unarchive and rename XCode.app to Xcode9.2.app in case you want to build and use it next time.


Install Bazel

If you have Homebrew installed, run :
brew install bazel
Or Download the installer https://github.com/bazelbuild/bazel/releases/download/0.15.2/bazel-0.15.2-installer-darwin-x86_64.sh and run :
chmod 755 bazel-0.15.2-installer-darwin-x86_64.sh
./bazel-0.15.2-installer-darwin-x86_64.sh
Install CUDA Toolkit 9.2.1


Download the CUDA Toolkit https://developer.nvidia.com/cuda-downloads?target_os=MacOSX&target_arch=x86_64&target_version=1013&target_type=dmglocal from NVIDIA.

There should be 2 packages to download, the core cuda_9.2.148_mac.dmg and a
patch cuda_9.2.148.1_mac.dmg.

Install them both in the order with the samples option (we will need them later).


NOTE : Is installed into /Developer/NVIDIA/CUDA-9.2

Install  NCCL


Download NCCL 2.2.13 O/S agnostic and CUDA 9.2 from https://developer.nvidia.com/nccl/nccl-download.


Unarchive it manualy


tar -xvzf ./nccl_2.2.13-1+cuda9.2_x86_64.txz

And move it to a permanent place e.g. /usr/local/nccl

sudo mkdir -p /usr/local/nccl
cd nccl_2.2.13-1+cuda9.2_x86_64
sudo mv * /usr/local/nccl
sudo mkdir -p /usr/local/include/third_party/nccl
sudo ln -s /usr/local/nccl/include/nccl.h /usr/local/include/third_party/nccl
Set up your env paths

Edit ~/.bash_profile and add the following according to your existing configuration
export PATH=/usr/local/bin:/usr/local/sbin:$PATH

export CUDA_HOME=/usr/local/cuda
export CUDA_TOOLKIT_HOME=/Developer/NVIDIA/CUDA-9.2

export DYLD_LIBRARY_PATH=$CUDA_TOOLKIT_HOME/lib:$CUDA_HOME/lib:$CUDA_HOME/extras/CUPTI/lib:$DYLD_LIBRARY_PATH
export LD_LIBRARY_PATH=$DYLD_LIBRARY_PATH
export PATH=$DYLD_LIBRARY_PATH:$CUDA_TOOLKIT_HOME/bin:$PATH
Check your installation

We want to compile some CUDA samples to check if the GPU is correctly recognized and supported.
cd /Developer/NVIDIA/CUDA-9.2/samples
sudo chown -R $(whoami) *
sudo make -C 1_Utilities/deviceQuery
./bin/x86_64/darwin/release/deviceQuery
It should return something similar.
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 750M"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2048 MBytes (2147024896 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            926 MHz (0.93 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS


NVIDIA cuDNN - Deep Learning Primitives

If not already done, register at https://developer.nvidia.com/cudnn
Download cuDNN 7.0.5
Change into your download directory and follow the post installation steps.
tar -xzvf cudnn-9.1-osx-x64-v7-ga.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*
Compile

Clone TensorFlow from Repository

cd /tmp
git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout v1.8.0
Apply Patch

Apply the following patch to fix a couple build issues
NOTE : Command line #2 that adds a new line at the end of the patch is very important!
wget https://gist.github.com/jeanjerome/b3b722bb1632e67251f42b19fcafb65d#file-xtensorflow18macos-patch
sed -i '' -e '$a\' xtensorflow18macos-patch
git apply xtensorflow18macos.patch
Configure Build

You need the following informations to proceed the configuration :

Python path e.g. /usr/local/bin/python3,
CUDA SDK version 9.2,
cuDNN version 7.2,
CUDA compute capabilities e.g. 3.0 for my poor NVIDIA GeForce GT 750M.

Pay attension to CUDA compute capabilities, you might want to find your own according to the output of the previous deviceQuery script.
So now run
./configure
You should see a series of prompts
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.15.2-homebrew installed.
Please specify the location of python. [Default is /usr/bin/python]: /usr/local/bin/python3


Found possible Python library paths:
  /usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Please input the desired Python library path to use.  Default is [/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages]

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]:
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]:
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: 
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: 
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.2


Please specify the location where CUDA 9.2 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.2


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:


Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]3.0


Do you want to use clang as CUDA compiler? [y/N]:  
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: 
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
Configuration finished
Build Process

Takes a bit more than 1.5 hour on my machine.
bazel clean
bazel build --config=cuda --config=opt --action_env PATH --action_env LD_LIBRARY_PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package
Create wheel file and install it

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
ls /tmp/tensorflow_pkg
tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl
If you want to use virtualenv or something, now is the time. Or just:
pip install /tmp/tensorflow_pkg/tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl
Backup your wheel if nothing goes wrong (Optional)

Files in /tmp would be cleaned after reboot.
cp /tmp/tensorflow_pkg/*.whl ~/
It's useful to leave the .whl file lying around in case you want to install it for another environment.
Test Installation

See if everything got linked correctly

I am using virtualenv (and you should do) so activate it
source ./bin/activate
python
Python 3.6.5 (default, Jun 17 2018, 12:13:06) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.Session()
2018-08-18 18:56:16.456756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
2018-08-18 18:56:16.457496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce GT 750M major: 3 minor: 0 memoryClockRate(GHz): 0.9255
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.51GiB
2018-08-18 18:56:16.457532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-18 18:56:17.305242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-18 18:56:17.305278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-08-18 18:56:17.305288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-08-18 18:56:17.305963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1280 MB memory) -> physical GPU (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0, compute capability: 3.0)
<tensorflow.python.client.session.Session object at 0x1035d9c50>
Test GPU Acceleration

Experience the new Autokeras framework.
It runs on tensorflow, with Python 3. Lucky we are !

pip install autokeras
wget https://gist.github.com/jeanjerome/b3b722bb1632e67251f42b19fcafb65d/raw/mnist_test.py
python mnist_test.py
Using TensorFlow backend.
2018-08-18 19:22:52.374988: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
2018-08-18 19:22:52.375213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce GT 750M major: 3 minor: 0 memoryClockRate(GHz): 0.9255
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.50GiB
2018-08-18 19:22:52.375236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-18 19:22:52.670486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-18 19:22:52.670521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-08-18 19:22:52.670525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-08-18 19:22:52.670607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1157 MB memory) -> physical GPU (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0, compute capability: 3.0)
Initializing search.
Initialization finished.
Training model  0
Using TensorFlow backend.
2018-08-18 19:22:56.373450: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
2018-08-18 19:22:56.373636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce GT 750M major: 3 minor: 0 memoryClockRate(GHz): 0.9255
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 198.35MiB
2018-08-18 19:22:56.373662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-18 19:22:56.680835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-18 19:22:56.680870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-08-18 19:22:56.680875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-08-18 19:22:56.680956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 161 MB memory) -> physical GPU (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0, compute capability: 3.0)
...........................................
Epoch 1: loss 2.106443405151367, metric_value 0.9856
...........................................
Epoch 2: loss 1.945971131324768, metric_value 0.984
...........................................
Epoch 3: loss 1.328037977218628, metric_value 0.9886
...........................................
Epoch 4: loss 1.2873637676239014, metric_value 0.99

Metrics

With my GPU, each epoch lasts 02:28 min.
With the CPU only, that lasts 02:35 min.
Damned ! All that fuss for what ?

Well, I couldn't be more pleased with my GeForce GT 750M. But those who have a NVIDIA GeForce GTX 10xx with eGPU should experiment a sharp acceleration.

Tested on a MacBook Pro (15-inch, mi-2014) 10.13.6, 2.5 GHz Intel Core i7 and NVIDIA GeForce GT 750M :(

  
## xtensorflow18macos.patch
diff --git a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
index 0f7adaf24a..934ccbada6 100644
--- a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
+++ b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
@@ -69,7 +69,7 @@ __global__ void concat_variable_kernel(
   IntType num_inputs = input_ptr_data.size;

   // verbose declaration needed due to template
-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

   if (useSmem) {
diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
index 94989089ec..1d26d4bacb 100644
--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
@@ -172,7 +172,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
@@ -452,7 +452,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
@@ -1118,7 +1118,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
@@ -1388,7 +1388,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
   T* const shared_data = reinterpret_cast<T*>(shared_memory);

   const int num_batches = args.batch;
diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
index 393818730b..58a1294005 100644
--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
@@ -121,7 +121,7 @@ __global__ void split_v_kernel(const T* input_ptr,
   int num_outputs = output_ptr_data.size;

   // verbose declaration needed due to template
-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
+  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

   if (useSmem) {
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 0ce5cda517..d4dc2235ac 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -361,11 +361,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "protobuf_archive",
       urls = [
-          "https://mirror.bazel.build/github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
-          "https://github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
+          "https://mirror.bazel.build/github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
+          "https://github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
       ],
-      sha256 = "846d907acf472ae233ec0882ef3a2d24edbbe834b80c305e867ac65a1f2c59e3",
-      strip_prefix = "protobuf-396336eb961b75f03b25824fe86cf6490fb75e3a",
+      sha256 = "eb16b33431b91fe8cee479575cee8de202f3626aaf00d9bf1783c6e62b4ffbc7",
+      strip_prefix = "protobuf-50f552646ba1de79e07562b41f3999fe036b4fd0",
   )

   # We need to import the protobuf library under the names com_google_protobuf
diff --git a/third_party/gpus/cuda/BUILD.tpl b/third_party/gpus/cuda/BUILD.tpl
index 2a37c65bc7..43446dd99b 100644
--- a/third_party/gpus/cuda/BUILD.tpl
+++ b/third_party/gpus/cuda/BUILD.tpl
@@ -110,7 +110,7 @@ cc_library(
         ".",
         "cuda/include",
     ],
-    linkopts = ["-lgomp"],
+    #linkopts = ["-lgomp"],
     linkstatic = 1,
     visibility = ["//visibility:public"],
 )
	import tensorflow as tf
	import keras

	config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 1} )
	sess = tf.Session(config=config)
	keras.backend.set_session(sess)

	from keras.datasets import mnist
	from autokeras import ImageClassifier

	if __name__ == '__main__':
	(x_train, y_train), (x_test, y_test) = mnist.load_data()
	x_train = x_train.reshape(x_train.shape + (1,))
	x_test = x_test.reshape(x_test.shape + (1,))

	clf = ImageClassifier(verbose=True, augment=False)
	clf.fit(x_train, y_train, time_limit=12 * 60 * 60)
	clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
	y = clf.evaluate(x_test, y_test)
	print(y * 100)
	diff --git a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	index 0f7adaf24a..934ccbada6 100644
	--- a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	+++ b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
	@@ -69,7 +69,7 @@ __global__ void concat_variable_kernel(
	IntType num_inputs = input_ptr_data.size;

	// verbose declaration needed due to template
	- extern __shared__ __align__(sizeof(T)) unsigned char smem[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
	IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

	if (useSmem) {
	diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	index 94989089ec..1d26d4bacb 100644
	--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
	@@ -172,7 +172,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
	const DepthwiseArgs args, const T* input, const T* filter, T* output) {
	assert(CanLaunchDepthwiseConv2dGPUSmall(args));
	// Holds block plus halo and filter data for blockDim.x depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	@@ -452,7 +452,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
	const DepthwiseArgs args, const T* input, const T* filter, T* output) {
	assert(CanLaunchDepthwiseConv2dGPUSmall(args));
	// Holds block plus halo and filter data for blockDim.z depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	@@ -1118,7 +1118,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
	const DepthwiseArgs args, const T* output, const T* input, T* filter) {
	assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
	// Holds block plus halo and filter data for blockDim.x depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	@@ -1388,7 +1388,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
	const DepthwiseArgs args, const T* output, const T* input, T* filter) {
	assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
	// Holds block plus halo and filter data for blockDim.z depths.
	- extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
	T* const shared_data = reinterpret_cast<T*>(shared_memory);

	const int num_batches = args.batch;
	diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
	index 393818730b..58a1294005 100644
	--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
	+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
	@@ -121,7 +121,7 @@ __global__ void split_v_kernel(const T* input_ptr,
	int num_outputs = output_ptr_data.size;

	// verbose declaration needed due to template
	- extern __shared__ __align__(sizeof(T)) unsigned char smem[];
	+ extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
	IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);

	if (useSmem) {
	diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
	index 0ce5cda517..d4dc2235ac 100644
	--- a/tensorflow/workspace.bzl
	+++ b/tensorflow/workspace.bzl
	@@ -361,11 +361,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
	tf_http_archive(
	name = "protobuf_archive",
	urls = [
	- "https://mirror.bazel.build/github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
	- "https://github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
	+ "https://mirror.bazel.build/github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
	+ "https://github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
	],
	- sha256 = "846d907acf472ae233ec0882ef3a2d24edbbe834b80c305e867ac65a1f2c59e3",
	- strip_prefix = "protobuf-396336eb961b75f03b25824fe86cf6490fb75e3a",
	+ sha256 = "eb16b33431b91fe8cee479575cee8de202f3626aaf00d9bf1783c6e62b4ffbc7",
	+ strip_prefix = "protobuf-50f552646ba1de79e07562b41f3999fe036b4fd0",
	)

	# We need to import the protobuf library under the names com_google_protobuf
	diff --git a/third_party/gpus/cuda/BUILD.tpl b/third_party/gpus/cuda/BUILD.tpl
	index 2a37c65bc7..43446dd99b 100644
	--- a/third_party/gpus/cuda/BUILD.tpl
	+++ b/third_party/gpus/cuda/BUILD.tpl
	@@ -110,7 +110,7 @@ cc_library(
	".",
	"cuda/include",
	],
	- linkopts = ["-lgomp"],
	+ #linkopts = ["-lgomp"],
	linkstatic = 1,
	visibility = ["//visibility:public"],
	)