bzamecnik/compiling_tensorflow_with_avx_fma_and_mkl.md

## compiling_tensorflow_with_avx_fma_and_mkl.md

      
    Raw
  

              compiling_tensorflow_with_avx_fma_and_mkl.md
            
          
    How to compile TensorFlow for CPUs with AVX2/FMA instructions or using Intel MKL?
In this tutorial we compile TF for EC2 Xeon processors and get +76 % speedup for
training CIFAR10 with ResNet. In practice this might be useful for speeding up
CPU inference at AWS EC2 where a GPU instance would be too costly.
Brought to you by Rossum. We automate data extraction from documents using deep learning.
Which machines?


m4.xlarge

2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors (cat /proc/cpuinfo)

flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt


2.4 GHz Intel Xeon® E5-2676 v3 (Haswell) processors


Basic docs: https://www.tensorflow.org/install/install_sources
TF says since 1.6 it's precompiled with AVX instructions, but in practice it seems that no:
2018-05-29 10:23:59.541587: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

Preparing build

Compiling on AWS EC2 m4.xlarge with Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz for Python 2.7.
Note: 8 GB disk can a bit small for build & testing, 16 GB should be fine.
Each build takes around 2 hours.
Get TensorFlow code

git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout r1.8

Install bazel

Docs: https://docs.bazel.build/versions/master/install-ubuntu.html
They tested TF 1.8 is Bazel 0.10.0.
sudo apt-get -y install pkg-config zip g++ zlib1g-dev unzip python
BAZEL_VERSION="0.10.0/bazel-0.10.0-installer-linux-x86_64"
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}.sh
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}.sh.sha256
sha256sum -c bazel-0.10.0-installer-linux-x86_64.sh.sha256 
bash bazel-0.10.0-installer-linux-x86_64.sh --user
export PATH="$PATH:$HOME/bin"
source /home/ubuntu/.bazel/bin/bazel-complete.bash

sudo apt-get -y install openjdk-8-jdk
echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install bazel

Install TensorFlow Python dependencies

sudo apt-get -y install python-numpy python-dev python-pip python-wheel

Avoid error "ImportError: No module named enum", # avoid error "ImportError: No module named mock":
sudo pip install --upgrade pip enum34 mock

Configure the installation

./configure

Google Cloud Platform support? [Y/n]: n
Hadoop File System support? [Y/n]: n
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
        --config=mkl            # Build with MKL support.
        --config=monolithic     # Config for mostly static monolithic build.
Configuration finished

When cross-compiling from other machine: -march=broadwell
Building the pip package

With AVX2/FMA

bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package

INFO: Elapsed time: 5461.730s, Critical Path: 92.59s # missing enum & mock
2nd try:
INFO: Elapsed time: 1841.308s, Critical Path: 70.27s
INFO: Build completed successfully, 1732 total actions

Make pip package:
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg

-> ~/tensorflow_pkg/tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl
With MKL

bazel clean

# https://github.com/tensorflow/tensorflow/issues/13928#issuecomment-355756988
rm -rd ~/.cache

# it automatically downloads and installs MKL
bazel build --config=opt --config=mkl //tensorflow/tools/pip_package:build_pip_package

# save to other directory, since the package name is the same
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg_mkl

Testing

Installing the pip package

sudo pip install ~/tensorflow_pkg/tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl

Validating your installation

Make sure not to be inside the ~/tensorflow dir, otherwise it fails!
cd
python

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

Measurements

On 4 cores at m4.xlarge.
Test and measure with Keras - MNIST CNN

sudo pip install Keras
wget https://raw.githubusercontent.com/keras-team/keras/master/examples/mnist_cnn.py
python mnist_cnn.py


build
steps
epoch time (s)
ms/step
speedup


default
60000
109
1.817
0%


with AVX2/FMA
60000
89
1.483
+22.5%


MKL
60000
136
2.667
-19.9%


-> +22.5% speedup on MNIST!, slowdown on MKL
Test and measure with Keras - CIFAR10 ResNet

wget https://raw.githubusercontent.com/keras-team/keras/master/examples/cifar10_resnet.py
python cifar10_resnet.py
modified to: 100 batches, batch_size=32, time at epoch 2 (274,442 params)


build
steps
epoch time (s)
ms/step
speedup


default
100
54
540
0 %


with AVX2/FMA
100
49
492
+9.75%


MKL
100
31
306
+76.5 %


-> in this case MKL is significantly faster!
We should measure that with our model, both builds seem to be promising.
Other Performance measurements - not working

git clone git@github.com:tensorflow/benchmarks.git tf_benchmarks
cd tf_benchmarks

# download CIFAR10 data
python -c 'from keras.datasets import cifar10; cifar10.load_data()'
# measure
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet56 --data_name cifar10 --data_dir ~/.keras/datasets/cifar-10-batches-py --num_batches 100

"InvalidArgumentError (see above for traceback): Default AvgPoolingOp only supports NHWC on device type CPU" :(
Publish to a private PyPI repository

Set up ~/.pypirc:
[distutils]
index-servers =
    pypi
    private

[pypi]

[private]
repository=http://pypi.example.com
username=...
password=...

sudo apt install -y twine python3-setuptools
sudo pip install setuptools

Rename and upload our packages. We use version suffix to differentiate the builds. They need underscores, not hyphens.
cd ~/tensorflow_pkg/
mv tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl tensorflow-1.8.0_broadwell-cp27-cp27mu-linux_x86_64.whl 
twine upload -r private_pypi tensorflow-1.8.0_broadwell-cp27-cp27mu-linux_x86_64.whl

cd ~/tensorflow_pkg_mkl/
mv tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl tensorflow-1.8.0_broadwell_mkl-cp27-cp27mu-linux_x86_64.whl 
twine upload -r private_pypi tensorflow-1.8.0_broadwell_mkl-cp27-cp27mu-linux_x86_64.whl

Installing via pip

Setup ~/.pip/pip.conf:
[global]
extra-index-url=https://user:password@pypi.example.com/simple/
# if using HTTP, enable this
#trusted-host=pypi.example.com

Broadwell AVX2/FMA:
pip install tensorflow==1.8.0-broadwell

Broadwell MKL:
pip install tensorflow==1.8.0-broadwell-mkl

Note that in the file name the version suffix needs underscores, while in pip it needs hyphens :facepalm_tone1:.
build	steps	epoch time (s)	ms/step	speedup
default	60000	109	1.817	0%
with AVX2/FMA	60000	89	1.483	+22.5%
MKL	60000	136	2.667	-19.9%
build	steps	epoch time (s)	ms/step	speedup
default	100	54	540	0 %
with AVX2/FMA	100	49	492	+9.75%
MKL	100	31	306	+76.5 %