How to compile TensorFlow for CPUs with AVX2/FMA instructions or using Intel MKL?
In this tutorial we compile TF for EC2 Xeon processors and get +76 % speedup for training CIFAR10 with ResNet. In practice this might be useful for speeding up CPU inference at AWS EC2 where a GPU instance would be too costly.
Brought to you by Rossum. We automate data extraction from documents using deep learning.
- m4.xlarge
- 2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors (cat /proc/cpuinfo)
- flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
- 2.4 GHz Intel Xeon® E5-2676 v3 (Haswell) processors
- 2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors (cat /proc/cpuinfo)
Basic docs: https://www.tensorflow.org/install/install_sources
TF says since 1.6 it's precompiled with AVX instructions, but in practice it seems that no:
2018-05-29 10:23:59.541587: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Compiling on AWS EC2 m4.xlarge
with Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
for Python 2.7.
Note: 8 GB disk can a bit small for build & testing, 16 GB should be fine.
Each build takes around 2 hours.
git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout r1.8
Docs: https://docs.bazel.build/versions/master/install-ubuntu.html
They tested TF 1.8 is Bazel 0.10.0.
sudo apt-get -y install pkg-config zip g++ zlib1g-dev unzip python
BAZEL_VERSION="0.10.0/bazel-0.10.0-installer-linux-x86_64"
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}.sh
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}.sh.sha256
sha256sum -c bazel-0.10.0-installer-linux-x86_64.sh.sha256
bash bazel-0.10.0-installer-linux-x86_64.sh --user
export PATH="$PATH:$HOME/bin"
source /home/ubuntu/.bazel/bin/bazel-complete.bash
sudo apt-get -y install openjdk-8-jdk
echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install bazel
sudo apt-get -y install python-numpy python-dev python-pip python-wheel
Avoid error "ImportError: No module named enum", # avoid error "ImportError: No module named mock":
sudo pip install --upgrade pip enum34 mock
./configure
Google Cloud Platform support? [Y/n]: n
Hadoop File System support? [Y/n]: n
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
Configuration finished
When cross-compiling from other machine: -march=broadwell
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
INFO: Elapsed time: 5461.730s, Critical Path: 92.59s # missing enum & mock
2nd try:
INFO: Elapsed time: 1841.308s, Critical Path: 70.27s
INFO: Build completed successfully, 1732 total actions
Make pip package:
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
-> ~/tensorflow_pkg/tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl
bazel clean
# https://github.com/tensorflow/tensorflow/issues/13928#issuecomment-355756988
rm -rd ~/.cache
# it automatically downloads and installs MKL
bazel build --config=opt --config=mkl //tensorflow/tools/pip_package:build_pip_package
# save to other directory, since the package name is the same
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg_mkl
sudo pip install ~/tensorflow_pkg/tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl
Make sure not to be inside the ~/tensorflow
dir, otherwise it fails!
cd
python
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
On 4 cores at m4.xlarge
.
sudo pip install Keras
wget https://raw.githubusercontent.com/keras-team/keras/master/examples/mnist_cnn.py
python mnist_cnn.py
build | steps | epoch time (s) | ms/step | speedup |
---|---|---|---|---|
default | 60000 | 109 | 1.817 | 0% |
with AVX2/FMA | 60000 | 89 | 1.483 | +22.5% |
MKL | 60000 | 136 | 2.667 | -19.9% |
-> +22.5% speedup on MNIST!, slowdown on MKL
wget https://raw.githubusercontent.com/keras-team/keras/master/examples/cifar10_resnet.py python cifar10_resnet.py
modified to: 100 batches, batch_size=32, time at epoch 2 (274,442 params)
build | steps | epoch time (s) | ms/step | speedup |
---|---|---|---|---|
default | 100 | 54 | 540 | 0 % |
with AVX2/FMA | 100 | 49 | 492 | +9.75% |
MKL | 100 | 31 | 306 | +76.5 % |
-> in this case MKL is significantly faster!
We should measure that with our model, both builds seem to be promising.
git clone git@github.com:tensorflow/benchmarks.git tf_benchmarks
cd tf_benchmarks
# download CIFAR10 data
python -c 'from keras.datasets import cifar10; cifar10.load_data()'
# measure
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet56 --data_name cifar10 --data_dir ~/.keras/datasets/cifar-10-batches-py --num_batches 100
"InvalidArgumentError (see above for traceback): Default AvgPoolingOp only supports NHWC on device type CPU" :(
Set up ~/.pypirc
:
[distutils]
index-servers =
pypi
private
[pypi]
[private]
repository=http://pypi.example.com
username=...
password=...
sudo apt install -y twine python3-setuptools
sudo pip install setuptools
Rename and upload our packages. We use version suffix to differentiate the builds. They need underscores, not hyphens.
cd ~/tensorflow_pkg/
mv tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl tensorflow-1.8.0_broadwell-cp27-cp27mu-linux_x86_64.whl
twine upload -r private_pypi tensorflow-1.8.0_broadwell-cp27-cp27mu-linux_x86_64.whl
cd ~/tensorflow_pkg_mkl/
mv tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl tensorflow-1.8.0_broadwell_mkl-cp27-cp27mu-linux_x86_64.whl
twine upload -r private_pypi tensorflow-1.8.0_broadwell_mkl-cp27-cp27mu-linux_x86_64.whl
Setup ~/.pip/pip.conf
:
[global]
extra-index-url=https://user:password@pypi.example.com/simple/
# if using HTTP, enable this
#trusted-host=pypi.example.com
Broadwell AVX2/FMA:
pip install tensorflow==1.8.0-broadwell
Broadwell MKL:
pip install tensorflow==1.8.0-broadwell-mkl
Note that in the file name the version suffix needs underscores, while in pip it needs hyphens :facepalm_tone1:.