# Note – this is not a bash script (some of the steps require reboot) | |
# I named it .sh just so Github does correct syntax highlighting. | |
# | |
# This is also available as an AMI in us-east-1 (virginia): ami-cf5028a5 | |
# | |
# The CUDA part is mostly based on this excellent blog post: | |
# http://tleyden.github.io/blog/2014/10/25/cuda-6-dot-5-on-aws-gpu-instance-running-ubuntu-14-dot-04/ | |
# Install various packages | |
sudo apt-get update | |
sudo apt-get upgrade -y # choose “install package maintainers version” | |
sudo apt-get install -y build-essential python-pip python-dev git python-numpy swig python-dev default-jdk zip zlib1g-dev | |
# Blacklist Noveau which has some kind of conflict with the nvidia driver | |
echo -e "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf | |
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf | |
sudo update-initramfs -u | |
sudo reboot # Reboot (annoying you have to do this in 2015!) | |
# Some other annoying thing we have to do | |
sudo apt-get install -y linux-image-extra-virtual | |
sudo reboot # Not sure why this is needed | |
# Install latest Linux headers | |
sudo apt-get install -y linux-source linux-headers-`uname -r` | |
# Install CUDA 7.0 (note – don't use any other version) | |
wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run | |
chmod +x cuda_7.0.28_linux.run | |
./cuda_7.0.28_linux.run -extract=`pwd`/nvidia_installers | |
cd nvidia_installers | |
sudo ./NVIDIA-Linux-x86_64-346.46.run | |
sudo modprobe nvidia | |
sudo ./cuda-linux64-rel-7.0.28-19326674.run | |
cd | |
# Install CUDNN 6.5 (note – don't use any other version) | |
# YOU NEED TO SCP THIS ONE FROM SOMEWHERE ELSE – it's not available online. | |
# You need to register and get approved to get a download link. Very annoying. | |
tar -xzf cudnn-6.5-linux-x64-v2.tgz | |
sudo cp cudnn-6.5-linux-x64-v2/libcudnn* /usr/local/cuda/lib64 | |
sudo cp cudnn-6.5-linux-x64-v2/cudnn.h /usr/local/cuda/include/ | |
# At this point the root mount is getting a bit full | |
# I had a lot of issues where the disk would fill up and then Bazel would end up in this weird state complaining about random things | |
# Make sure you don't run out of disk space when building Tensorflow! | |
sudo mkdir /mnt/tmp | |
sudo chmod 777 /mnt/tmp | |
sudo rm -rf /tmp | |
sudo ln -s /mnt/tmp /tmp | |
# Note that /mnt is not saved when building an AMI, so don't put anything crucial on it | |
# Install Bazel | |
cd /mnt/tmp | |
git clone https://github.com/bazelbuild/bazel.git | |
cd bazel | |
git checkout tags/0.1.0 | |
./compile.sh | |
sudo cp output/bazel /usr/bin | |
# Install TensorFlow | |
cd /mnt/tmp | |
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64" | |
export CUDA_HOME=/usr/local/cuda | |
git clone --recurse-submodules https://github.com/tensorflow/tensorflow | |
cd tensorflow | |
# Patch to support older K520 devices on AWS | |
# wget "https://gist.githubusercontent.com/infojunkie/cb6d1a4e8bf674c6e38e/raw/5e01e5b2b1f7afd3def83810f8373fbcf6e47e02/cuda_30.patch" | |
# git apply cuda_30.patch | |
# According to https://github.com/tensorflow/tensorflow/issues/25#issuecomment-156234658 this patch is no longer needed | |
# Instead, you need to run ./configure like below (not tested yet) | |
TF_UNOFFICIAL_SETTING=1 ./configure | |
bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer | |
# Build Python package | |
# Note: you have to specify --config=cuda here - this is not mentioned in the official docs | |
# https://github.com/tensorflow/tensorflow/issues/25#issuecomment-156173717 | |
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package | |
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg | |
sudo pip install /tmp/tensorflow_pkg/tensorflow-0.5.0-cp27-none-linux_x86_64.whl | |
# Test it! | |
cd tensorflow/models/image/cifar10/ | |
python cifar10_multi_gpu_train.py | |
# On a g2.2xlarge: step 100, loss = 4.50 (325.2 examples/sec; 0.394 sec/batch) | |
# On a g2.8xlarge: step 100, loss = 4.49 (337.9 examples/sec; 0.379 sec/batch) | |
# doesn't seem like it is able to use the 4 GPU cards unfortunately :( |
This comment has been minimized.
This comment has been minimized.
Erik, thanks for these notes and the AMI, I wanted to play around with GPU instances on AWS so this was very useful! WRT the AMI, actually I ended up re-running the bazel installation and re-fetching and building the latest tensorflow (I wanted to run the convolutional.py example without the final test crashing, for which the latest source with the BFC allocator as default was useful) - from this perspective it would actually be more convenient if the bazel and tensorflow trees were left on the AMI (rather than being excluded by putting them on /mnt) Also I guess wfbradley probably also tested it but TF_UNOFFICIAL_SETTING=1 ./configure works as advertised. |
This comment has been minimized.
This comment has been minimized.
It works for me without blacklisting Noveau. |
This comment has been minimized.
This comment has been minimized.
I also wanted to do a git clone and recompile TensorFlow in order to get the latest ImageNet model. I too reinstalled bazel, since the latest version of the TensorFlow code requires bazel 0.1.1 (as described here: https://www.tensorflow.org/versions/master/get_started/os_setup.html), i.e. do a |
This comment has been minimized.
This comment has been minimized.
Do you need |
This comment has been minimized.
This comment has been minimized.
Yes, @hammer is right, |
This comment has been minimized.
This comment has been minimized.
Hi, Thanks for the great work. I am trying to compare CPU / GPU and different hw and I am getting this: Macbook Pro i5 2,6Ghz (cifar10_train.py) AWS g2.2xlarge GPU $2/hour - cifar10_multi_gpu_train.py AWS g2.2xlarge NO GPU $2/hour - cifar10_multi_gpu_train.py Is it normal that my Mac is less than 2x slower than a g2.2xlarge that uses a GPU? I has expecting 10x... |
This comment has been minimized.
This comment has been minimized.
These were very useful Erik - I finally got around to using tensor flow today |
This comment has been minimized.
This comment has been minimized.
I'm seeing the same performance numbers as @marcotrombetti on g2.2xlarge instances, both on GPU and on CPU. This seems to be many times slower than Theano on the same hardware when running on GPU. Is this expected, or is this indicative of some misconfiguration on my side? |
This comment has been minimized.
This comment has been minimized.
To correct Line 88 above, it CAN use all four cores Performance AttributesAll measured at step = 50
NotesTo run with 4 cores call example with --num_gpus = 4 Very important you set Compute Capability = 3.0 (thanks @wfbradley) If you pulled the latest tensorflow version 0.6 you need to change Line 80 However for me the led to segfaults in the example due to an issue with the Eigen Kernel. This has temporarily been resolved. Please see: |
This comment has been minimized.
This comment has been minimized.
Why does Erik install tensorflow the way he does? Why not use pip? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Thanks @closedLoop for all the information, I get the same numbers. Still strange that @erikbern and @marcotrombetti report much higher speed. (32_x_ examples / sec instead of 22_x_ examples / sec) |
This comment has been minimized.
This comment has been minimized.
To build for Python 3.4:
|
This comment has been minimized.
This comment has been minimized.
Just for reference. Geforce 970, i7 Local Machine, examples/sec 903.3 examples/sec .142 sec/batch |
This comment has been minimized.
This comment has been minimized.
Just got to run. Thank you for the code. My timing - measured at step 50
Quite unimpressive the usage of gpus, as @marcotrombetti says. I was also expecting an order of magnitude improvement. |
This comment has been minimized.
This comment has been minimized.
I have done all the steps but the last pip install seems to be an issue I used git "checkout tags/0.1.4" for basel instead. *Edit ok silly me: Line 80 needs to be changed to 0.60 instead. |
This comment has been minimized.
This comment has been minimized.
You can download CUDNN 6.5: http://developer.download.nvidia.com/compute/redist/cudnn/v2/cudnn-6.5-linux-x64-v2.tgz |
This comment has been minimized.
This comment has been minimized.
Using keras mnist_cnn script to compare the performance of Theano to Tensorflow on |
This comment has been minimized.
This comment has been minimized.
In case anyone's interested, we documented how we installed TensorFlow along with Python 3.4 and Jupyter on EC2 based on this gist and many of the comments here. Thank you everyone! |
This comment has been minimized.
This comment has been minimized.
Didn't notice all the comments here – Github doesn't send notifications on gists I guess. Anyway you should check out @chrisconley's link instead! |
This comment has been minimized.
This comment has been minimized.
Nice work! Thnx for guide ^^ |
This comment has been minimized.
This comment has been minimized.
If you for some reason found |
This comment has been minimized.
This comment has been minimized.
There seems to be a new solution: https://aws.amazon.com/marketplace/pp/B01AOE205O |
This comment has been minimized.
This comment has been minimized.
@AlexJoz's ami works great |
This comment has been minimized.
This comment has been minimized.
Published new AMI in N. Virginia with 0.8.0 support: ami-1e19ee73 |
This comment has been minimized.
This comment has been minimized.
Thanks. I will look into my changing my bash file that installs the cpu version of tensorflow (with video) to a gpu version on cloud9 http://c9.io |
This comment has been minimized.
This comment has been minimized.
When I run line 73, I get an error: Unrecognized option: --host_force_python=py2 Any idea why? |
This comment has been minimized.
This comment has been minimized.
@shamak I'm getting the exact same error. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
It seems that Cudnn can be downloaded with curl -fvSL http://developer.download.nvidia.com/compute/redist/cudnn/v2/cudnn-6.5-linux-x64-v2.tgz -o cudnn-6.5-linux-x64-v2.tgz |
This comment has been minimized.
This comment has been minimized.
Just for reference: GTX 1070, i7 6700k, local machine, tensorflow inside docker container, using nvidia-docker (but I doubt it adds any overhead) 1744.3 examples/sec; 0.073 sec/batch |
This comment has been minimized.
This comment has been minimized.
I've recently prepared a couple of convenience scripts for firing up your AWS instance with Jupyter Notebook on board that you may find useful: |
This comment has been minimized.
This comment has been minimized.
For stat: Zotac GTX 1080 AMP Extreme, 2560 CUDA cores, 1771 MHz core clock, 10000 MHz mem clock. i7 930 3.8 GHz boost clock. step 100000, loss = 0.72 (1780.0 examples/sec; 0.072 sec/batch); time: 2h 5m. |
This comment has been minimized.
Thank you for making these notes! A few additions:
Line 37: In particular, don't install CUDNN 7.0 :)
Line 58: I had to run "mkdir /tmp/ubuntu" first.
Line 72: One should enter 3.0 when prompted for compute capability on AWS, i.e.:
[Default is: "3.5,5.2"]: 3.0