Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save bzamecnik/22340d5ba463eb25fd859f1bda3ab530 to your computer and use it in GitHub Desktop.
Save bzamecnik/22340d5ba463eb25fd859f1bda3ab530 to your computer and use it in GitHub Desktop.
Compiling TensorFlow 1.8 with AVX2/FMA instruction and with Intel MKL

How to compile TensorFlow for CPUs with AVX2/FMA instructions or using Intel MKL?

In this tutorial we compile TF for EC2 Xeon processors and get +76 % speedup for training CIFAR10 with ResNet. In practice this might be useful for speeding up CPU inference at AWS EC2 where a GPU instance would be too costly.

Brought to you by Rossum. We automate data extraction from documents using deep learning.

Which machines?

  • m4.xlarge
    • 2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors (cat /proc/cpuinfo)
      • flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
    • 2.4 GHz Intel Xeon® E5-2676 v3 (Haswell) processors

Basic docs: https://www.tensorflow.org/install/install_sources

TF says since 1.6 it's precompiled with AVX instructions, but in practice it seems that no:

2018-05-29 10:23:59.541587: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

Preparing build

Compiling on AWS EC2 m4.xlarge with Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz for Python 2.7.

Note: 8 GB disk can a bit small for build & testing, 16 GB should be fine.

Each build takes around 2 hours.

Get TensorFlow code

git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout r1.8

Install bazel

Docs: https://docs.bazel.build/versions/master/install-ubuntu.html

They tested TF 1.8 is Bazel 0.10.0.

sudo apt-get -y install pkg-config zip g++ zlib1g-dev unzip python
BAZEL_VERSION="0.10.0/bazel-0.10.0-installer-linux-x86_64"
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}.sh
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}.sh.sha256
sha256sum -c bazel-0.10.0-installer-linux-x86_64.sh.sha256 
bash bazel-0.10.0-installer-linux-x86_64.sh --user
export PATH="$PATH:$HOME/bin"
source /home/ubuntu/.bazel/bin/bazel-complete.bash
sudo apt-get -y install openjdk-8-jdk
echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install bazel

Install TensorFlow Python dependencies

sudo apt-get -y install python-numpy python-dev python-pip python-wheel

Avoid error "ImportError: No module named enum", # avoid error "ImportError: No module named mock":

sudo pip install --upgrade pip enum34 mock

Configure the installation

./configure
Google Cloud Platform support? [Y/n]: n
Hadoop File System support? [Y/n]: n
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
        --config=mkl            # Build with MKL support.
        --config=monolithic     # Config for mostly static monolithic build.
Configuration finished

When cross-compiling from other machine: -march=broadwell

Building the pip package

With AVX2/FMA

bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
INFO: Elapsed time: 5461.730s, Critical Path: 92.59s # missing enum & mock
2nd try:
INFO: Elapsed time: 1841.308s, Critical Path: 70.27s
INFO: Build completed successfully, 1732 total actions

Make pip package:

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg

-> ~/tensorflow_pkg/tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl

With MKL

bazel clean

# https://github.com/tensorflow/tensorflow/issues/13928#issuecomment-355756988
rm -rd ~/.cache

# it automatically downloads and installs MKL
bazel build --config=opt --config=mkl //tensorflow/tools/pip_package:build_pip_package

# save to other directory, since the package name is the same
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg_mkl

Testing

Installing the pip package

sudo pip install ~/tensorflow_pkg/tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl

Validating your installation

Make sure not to be inside the ~/tensorflow dir, otherwise it fails!

cd
python
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

Measurements

On 4 cores at m4.xlarge.

Test and measure with Keras - MNIST CNN

sudo pip install Keras
wget https://raw.githubusercontent.com/keras-team/keras/master/examples/mnist_cnn.py
python mnist_cnn.py
build steps epoch time (s) ms/step speedup
default 60000 109 1.817 0%
with AVX2/FMA 60000 89 1.483 +22.5%
MKL 60000 136 2.667 -19.9%

-> +22.5% speedup on MNIST!, slowdown on MKL

Test and measure with Keras - CIFAR10 ResNet

wget https://raw.githubusercontent.com/keras-team/keras/master/examples/cifar10_resnet.py python cifar10_resnet.py

modified to: 100 batches, batch_size=32, time at epoch 2 (274,442 params)

build steps epoch time (s) ms/step speedup
default 100 54 540 0 %
with AVX2/FMA 100 49 492 +9.75%
MKL 100 31 306 +76.5 %

-> in this case MKL is significantly faster!

We should measure that with our model, both builds seem to be promising.

Other Performance measurements - not working

git clone git@github.com:tensorflow/benchmarks.git tf_benchmarks
cd tf_benchmarks
# download CIFAR10 data
python -c 'from keras.datasets import cifar10; cifar10.load_data()'
# measure
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet56 --data_name cifar10 --data_dir ~/.keras/datasets/cifar-10-batches-py --num_batches 100

"InvalidArgumentError (see above for traceback): Default AvgPoolingOp only supports NHWC on device type CPU" :(

Publish to a private PyPI repository

Set up ~/.pypirc:

[distutils]
index-servers =
    pypi
    private

[pypi]

[private]
repository=http://pypi.example.com
username=...
password=...
sudo apt install -y twine python3-setuptools
sudo pip install setuptools

Rename and upload our packages. We use version suffix to differentiate the builds. They need underscores, not hyphens.

cd ~/tensorflow_pkg/
mv tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl tensorflow-1.8.0_broadwell-cp27-cp27mu-linux_x86_64.whl 
twine upload -r private_pypi tensorflow-1.8.0_broadwell-cp27-cp27mu-linux_x86_64.whl

cd ~/tensorflow_pkg_mkl/
mv tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl tensorflow-1.8.0_broadwell_mkl-cp27-cp27mu-linux_x86_64.whl 
twine upload -r private_pypi tensorflow-1.8.0_broadwell_mkl-cp27-cp27mu-linux_x86_64.whl

Installing via pip

Setup ~/.pip/pip.conf:

[global]
extra-index-url=https://user:password@pypi.example.com/simple/
# if using HTTP, enable this
#trusted-host=pypi.example.com

Broadwell AVX2/FMA:

pip install tensorflow==1.8.0-broadwell

Broadwell MKL:

pip install tensorflow==1.8.0-broadwell-mkl

Note that in the file name the version suffix needs underscores, while in pip it needs hyphens :facepalm_tone1:.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment