Skip to content

Instantly share code, notes, and snippets.

@planetceres
Last active March 5, 2020 20:56
Show Gist options
  • Save planetceres/a992984cb702495a07fd35c103429120 to your computer and use it in GitHub Desktop.
Save planetceres/a992984cb702495a07fd35c103429120 to your computer and use it in GitHub Desktop.
Horovod Cluster Setup with CUDA 9.0, Miniconda3 and Tensorflow 1.8rc

Setup of GPU Enabled VM for Distributed Tensorflow using Horovod

Update CentOS

Update CentOS and reboot (you will need to login again):

sudo yum update -y
sudo reboot

Install Spark and JRE

As: root user

Install Java and misc:

sudo yum install -y rsync net-tools sbt git bzip2 java-1.8.0-openjdk.x86_64 java-1.8.0-openjdk-devel
java -version

echo export JAVA_HOME=\"$(readlink -f $(which java) | grep -oP '.*(?=/bin)')\" >> /root/.bash_profile
source /root/.bash_profile
$JAVA_HOME/bin/java -version

Download and extract Spark:

 curl http://mirrors.ocf.berkeley.edu/apache/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz | tar -zx -C /usr/local --show-transformed --transform='s,/*[^/]*,spark,'

Add Spark to bash_profile

echo export SPARK_HOME=\"/usr/local/spark\" >> /root/.bash_profile
echo export SPARK_EXAMPLES_JAR=\"/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar\" >> /root/.bash_profile
echo "export PATH=$PATH:$HOME/.local/bin:$HOME/bin:/usr/local/spark/bin" >> /root/.bash_profile
source /root/.bash_profile

Check that Spark is installed correctl

spark-shell

Create Linux User

Create user <username>:

adduser <username>

Set password:

passwd <username>

Add user to wheel group for sudo privileges:

usermod -aG wheel <username>

Login as user:

su - <username>

Copy ~/.bash_profile from root:

sudo cp /root/.bash_profile ~/.bash_profile
source ~/.bash_profile

Or make sure spark and java paths are in new user's profile:

PATH=$PATH:$HOME/bin:$HOME/.local/bin:$HOME/bin:/usr/local/spark/bin

export PATH
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre"
export JRE_HOME="/usr/lib/jvm/jre"
export SPARK_HOME="/usr/local/spark"
export SPARK_EXAMPLES_JAR="/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar"
export JRE_HOME="/usr/lib/jvm/jre"

Setup Git SSH

As: non-root user

Add git ssh:

mkdir /home/<centos_username>/.ssh/github
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
Enter file in which to save the key (/home/<centos_username>/.ssh/id_rsa): /home/<centos_username>/.ssh/github/id_rsa

Login to https://www.github.com and add ssh key from:

cat ~/.ssh/github/id_rsa.pub

Start ssh agent and add key:

NOTE: You may want to add this to ~/.bash_profile to avoid having to redo this step at every login

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/github/id_rsa

Install GCC

Check for existing installation:

which gcc

Install gcc:

sudo yum group install "Development Tools"
whereis gcc

CUDA installation

Install cuda: instructions

sudo yum install pciutils

Check compatibility:

lspci | grep -i nvidia
uname -m && cat /etc/*release
uname -r

Download development packages:

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

Enable epel release

sudo yum install --enablerepo=extras epel-release

Install DKMS and libvdpau

sudo yum install kernel-debug-devel dkms
sudo yum install libvdpau

Download CUDA toolkit 9.0 and patches: IMPORTANT: CUDA toolkit 9.1 not currently supported by Tensorflow

wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-rhel7-9-0-local-9.0.176-1.x86_64-rpm
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/1/cuda-repo-rhel7-9-0-local-cublas-performance-update-1.0-1.x86_64-rpm
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/2/cuda-repo-rhel7-9-0-local-cublas-performance-update-2-1.0-1.x86_64-rpm

Install cuda toolkit and patches

sudo rpm -i cuda-repo-rhel7-9-0-local-9.0.176-1.x86_64-rpm
sudo rpm -i cuda-repo-rhel7-9-0-local-cublas-performance-update-1.0-1.x86_64-rpm
sudo rpm -i cuda-repo-rhel7-9-0-local-cublas-performance-update-2-1.0-1.x86_64-rpm
sudo yum clean all
sudo yum install cuda-9-0
sudo yum clean expire-cache

Environment Setup:

export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

or to add to ~/.bash_profile:
echo 'export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}' >> ~/.bash_profile
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bash_profile
source ~/.bash_profile

Update /usr/lib/systemd/system/nvidia-persistenced.service using text editor such as vim:

# NVIDIA Persistence Daemon Init Script
#
# Copyright (c) 2013 NVIDIA Corporation
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
#
# This is a sample systemd service file, designed to show how the NVIDIA
# Persistence Daemon can be started.
#


# Defaults
#[Unit]
#Description=NVIDIA Persistence Daemon
#Wants=syslog.target

#[Service]
#Type=forking
#ExecStart=/usr/bin/nvidia-persistenced --user root
#ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

#[Install]
#WantedBy=multi-user.target



[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

Comment out line in /lib/udev/rules.d/40-redhat.rules as shown below:

# Memory hotadd request
# SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

Reboot machine (from root):

reboot

After reboot start nvidia persistence daemon from root user:

/usr/bin/nvidia-persistenced --verbose

Confirm CUDA Installation

Log back in as user and test installation:

cuda-install-samples-9.0.sh ./
cd ~/NVIDIA_CUDA-9.0_Samples
sudo make

cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQueryDrv
./deviceQueryDrv

If CUDA was installed correctly, output of ./deviceQueryDrv should look something like (note driver version):

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: "Tesla P100-PCIE-16GB"
  CUDA Driver Version:                           9.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16276 MBytes (17066885120 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1329 MHz (1.33 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 7
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

Install Horovod on GPU dependencies

instructions

Open MPI Installation

Install open MPI: instructions

wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.1.tar.gz
tar -xzf openmpi-3.0.1.tar.gz && cd openmpi-3.0.1
./configure --with-cuda
make -j$(nproc)
sudo make install

Install NCCL CUDA 9.0

NOTE: These must be downloaded from NVIDIA's website and don't work via wget so we use a scp transfer from local machine to VM NOTE ALSO: make sure these are uploaded to usr/local/ on VM

Install NCCL2: instructions

Download NCCL 2.1.15 O/S agnostic and CUDA 9.0 from: https://developer.nvidia.com/nccl/nccl-download

To transfer locally downloaded NCCL files to VM:

scp -i ~/.ssh/<VM_key> /<local_path_to_nccl_file>/nccl_2.1.15-1+cuda9.0_x86_64.txz <USERNAME>@<VM_IP_ADDRESS>:/usr/local/nccl_2.1.15-1+cuda9.0_x86_64.txz

Download to /usr/local:

cd /usr/local
sudo tar xvf nccl_2.1.15-1+cuda9.0_x86_64.txz

Add to env variables:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl_2.1.15-1+cuda9.0_x86_64/lib

Or to add directly to ~/.bash_profile (preferred):

echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl_2.1.15-1+cuda9.0_x86_64/lib' >> ~/.bash_profile
source ~/.bash_profile

Install cuDNN

Install cuDNN: instructions

Download: https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.1.2/prod/9.0_20180316/cudnn-9.0-linux-x64-v7.1 from https://developer.nvidia.com/rdp/cudnn-download

cd /usr/local
sudo tar -xzvf cudnn-9.0-linux-x64-v7.1.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

Check Installation:

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

Install Miniconda3

Install latest miniconda:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Do you wish the installer to prepend the Miniconda3 install location
to PATH in your /home/<username>/.bashrc ? [yes|no]

>> yes

Reload .bashrc

source ~/.bashrc

Check installation

which conda

Remove installer

rm -f Miniconda3-latest-Linux-x86_64.sh

Create Conda Env:

conda create -n <ENV_NAME> python=3.6
conda update conda
conda activate <ENV_NAME>
conda install -c conda-forge pyspark
pip install ipython ipykernel jupyterlab
python -m ipykernel install --user --name <ENV_NAME> --display-name "Python (<ENV_NAME>)"

Install Tensorflow and Keras in Conda Environment

Install Tensorflow: NOTE: tensorflow 1.7 currently doesn't work with cuDNN, so we need the nightly build

First install dependencies

sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel hdf5-devel

Specific tf-nightly-gpu version that was installed in this setup: 1.8.0.dev20180329

Activate conda environment (conda activate w251) then:

pip install tf-nightly-gpu
pip install keras
pip install h5py

Pip install Horovod using NCCL and Allreduce options:

HOROVOD_NCCL_HOME=/usr/local/nccl_2.1.15-1+cuda9.0_x86_64 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

Clone horovod repo:

git clone git@github.com:uber/horovod.git

To run examples in horovod/examples: instructions

cd horovod/examples

mpirun -np 1 \
    -H localhost:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    python tensorflow_mnist.py

Troubleshooting

If you are getting mpi errors when running examples, you might try the following:

Install mpi4py: NOTE: This may affect relative paths of the Open MPI install we did system-wide. Make sure to do this inside a Conda environment

conda install -c conda-forge mpi4py

Alternatively, you can try running mpirun using an absolute path:

/usr/local/bin/mpirun -np 1 \
    -H localhost:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    python tensorflow_mnist.py

You can also change the options. Look at mpirun --help for more info:

mpirun --allow-run-as-root  \
  -np 4  -H 4 \
  -bind-to none -map-by slot \
  -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO \
  -x HOROVOD_FUSION_THRESHOLD=4194304 \
   -x NCCL_SOCKET_IFNAME=eth0 \
  python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 \
  --num_gpus 1 --model resnet50 --batch_size 128 \
  --data_name imagenet --data_dir /data0/imagenet_data \
  --variable_update horovod --horovod_device gpu \
  --batch_group_size=4 --num_inter_threads=0 --optimizer momentum

Helpful commands:

Get list of cuda repos installed: NOTE: this helps repair problems relating to CUDA versioning (i.e. 9.0 vs 9.1)

yum --disablerepo="*" --enablerepo="cuda*" list available

also:

modinfo nvidia

To remove a previous nvidia installation entirely:

sudo yum remove nvidia*
rm -rf /etc/yum.repos.d/cuda-<prev_vers>*

rm -rf /etc/yum.repos.d/cuda-9-1*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment