ietz/hummel-tensorflow-setup.md

## hummel-tensorflow-setup.md

      
    Raw
  

              hummel-tensorflow-setup.md
            
          
    Hummel einrichten für Tensorflow und Horovod

SSH konfigurieren:

In lokale SSH config einfügen:
Host hg hf hi
	User <nutzer>
	HostName hummel2.rrz.uni-hamburg.de
	IdentityFile ~/.ssh/id_rsa_hummel

Host hf
	RequestTTY force
	RemoteCommand ssh front2

Host hi
	RequestTTY force
	RemoteCommand ssh -tt front2 'bash -l -c "$HOME/si.sh"'


Das Hummel Gateway hg wird nur zum Hochladen von Dateien verwendet: sftp hg.
Der Frontend-Knoten hf ist zum Vorbereiten und Starten Jobs da: ssh hf.
Mit hi wird eine interaktive Verbindung zu einem GPU-Knoten hergestellt: ssh hi.
Programme etc. installieren in $HOME. Liegt in /home/<nutzer>.
Dateien, auf denen gearbeitet wird (z.B. Datasets) in $WORK. Liegt in /work/<nutzer>.
Als Zwischenspeicher können die auf GPU-Knoten verfügbaren SSDs unter /scratch/ genutzt werden.

Umgebung einrichten

Hilfsscript, um eine interaktive Verbindung zu einem GPU-Knoten herzustellen. Speichern unter $HOME/si.sh und mit chmod +x si.sh ausführbar machen.
#! /bin/bash
JobName="sponb${1:-0}"
AttachId=$(squeue -u "$USER" --name="$JobName" --Format="jobid" | sed -n '1!p' | shuf -n 1 | xargs)

if [ -z "$AttachId" ]
then
        salloc -p gpu --job-name="$JobName" srun --pty bash
else
        srun --jobid="$AttachId" --pty bash
fi
Und in die $HOME/.bashrc einfügen:
#!/bin/bash
export PATH=$PATH:$HOME/bin
export KERAS_HOME=$WORK

# Horovod Python Script auf allen zugewiesenen Nodes laufen lassen
# Bsp.:
# $ shvd python main.py
shvd ()
{
    mpirun \
      -bind-to none -map-by slot \
      -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
      -x NCCL_P2P_DISABLE=1 \
      -mca pml ob1 -mca btl ^openib \
      "$@"
}

# GPU Knoten zum interaktiven Arbeiten holen
# Bsp. für 4 Knoten:
# $ sa 4
sa ()
{
    salloc -p gpu -N "${1:-1}" --ntasks-per-node=2
}

# Batch Job absenden und Ausgabe in Konsole anzeigen
# Bsp.:
# $ sb run.sh
sb () {
    output_param="slurm-\%j.out"
    exp='^#SBATCH +(-o|--output)=(.+) *$'
    while read -r line; do
        if [[ $line =~ $exp ]]; then
            output_param=$(echo "${BASH_REMATCH[2]}" | xargs)
            break
        fi
    done < "$1"

    jobid=$(sbatch --parsable "$1")
    log_file=${output_param/\%j/$jobid}

    echo "Job #$jobid queued and will be run in background…"
    echo "To reattach later use:"
    echo "$ tail -f \"$log_file\""
    echo

    touch "$log_file"
    tail -f "$log_file"
}

## Notwendige Environment-Module automatisch laden (wenn man nicht nur auf dem Gateway ist, bei dem es das Modulsystem nicht gibt)
if type module &> /dev/null; then
    module switch env env/2019Q1-cuda-gcc-openmpi
    module load nano/4.0
fi
Zuletzt noch das Verzeichnis für später geladene Binaries einrichten und alles aktivieren:
$ mkdir ~/bin
$ echo "source ~/.bashrc;" >> ~/.bash_profile
$ source ~/.bashrc
Python einrichten

venv erstellen

$ mkdir ~/venvs
$ python3 -m venv --system-site-packages ~/venvs/spon/
$ source ~/venvs/spon/bin/activate
Tensorflow bauen

cuDNN und NCCL

Lokal auf eigenem Rechner
cuDNN v7.3.1
und
NCCL v2.2.13
jeweils für CUDA 9.0 runterladen.
Dafür muss man einen nVidia Developer Account erstellen und eine kurze Umfrage beantworten.
Anschließend beide Dateien auf Hummel kopieren:
$ scp cudnn-9.0-linux-x64-v7.3.1.20.txz nccl_2.2.13-1+cuda9.0_x86_64 hg:/home/<nutzer>

Auf den Hummel Front-Servern ausführen:
$ cd ~
$ tar -xzvf cudnn-9.0-linux-x64-v7.3.1.20.txz
$ rm cudnn-9.0-linux-x64-v7.3.1.20.txz
$ echo 'export CPATH=$CPATH:$HOME/cuda/include' >> ~/.bashrc
$ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/cuda/lib64' >> ~/.bashrc

$ tar xvf nccl_2.2.13-1+cuda9.0_x86_64.txz
$ rm nccl_2.2.13-1+cuda9.0_x86_64.txz
$ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/nccl_2.2.13-1+cuda9.0_x86_64/lib' >> ~/.bashrc

$ source ~/.bashrc
bazel: Build-tool für Tensorflow

$ mkdir $WORK/bazel
$ cd $WORK/bazel
$ wget https://github.com/bazelbuild/bazel/releases/download/0.15.2/bazel-0.15.2-dist.zip
$ unzip -d bazel-dist bazel-0.15.2-dist.zip
$ rm bazel-0.15.2-dist.zip
$ cd bazel-dist/

$ module load java/oracle-jdk8u101
$ module load zip/3.0
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LIBRARY_PATH
$ env EXTRA_BAZEL_ARGS="--host_javabase=@local_jdk//:jdk" bash ./compile.sh

$ mkdir ~/bin
$ mv $WORK/bazel/bazel-dist/output/bazel ~/bin/
$ cd ~
$ rm -rf $WORK/bazel/
Python Dependencies von Tensorflow

$ pip install -U pip six numpy wheel setuptools mock
$ pip install -U keras_applications==1.0.6 --no-deps
$ pip install -U keras_preprocessing==1.0.5 --no-deps
Tensorflow runterladen und konfigurieren

$ cd $WORK/
$ module load git/2.21.0
$ git clone https://github.com/tensorflow/tensorflow.git
$ cd $WORK/tensorflow/
$ git checkout v1.12.2
$ ./configure
Bei der Konfiguration können die meisten Default-Werte beibehalten werden, außer:

CUDA support: y
CUDA 9.0 toolkit path: /sw/compiler/cuda-9.0.176
cuDNN 7 library path: /home/<nutzer>/cuda
NCCL 2 library path: /home/<nutzer>/nccl_2.2.13-1+cuda9.0_x86_64
CUDA compute capability: 3.7 (für K80)
MPI support: y
Optimization flags: -march=haswell

Vollständige Ausgabe beim Konfigurieren
(spon) -bash-4.2$ ./configure
WARNING: Output base '/home/<nutzer>/.cache/bazel/_bazel_<nutzer>/684406771976db6c38244bb3892bbf4d' is on NFS. This may lead to surprising failures and undetermined behavior.
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.15.2- (@non-git) installed.
Please specify the location of python. [Default is /home/<nutzer>/venvs/spon/bin/python]:


Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'site' has no attribute 'getsitepackages'
Found possible Python library paths:
  /home/<nutzer>/venvs/spon/lib/python3.6/site-packages
Please input the desired Python library path to use.  Default is [/home/<nutzer>/venvs/spon/lib/python3.6/site-packages]

Do you wish to build TensorFlow with Apache Ignite support? [Y/n]:
Apache Ignite support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]:


Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /sw/compiler/cuda-9.0.176


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]:


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /sw/compiler/cuda-9.0.176]: /home/<nutzer>/cuda


Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]:


Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /sw/compiler/cuda-9.0.176]:/home/<nutzer>/nccl_2.2.13-1+cuda9.0_x86_64


Assuming NCCL header path is /home/<nutzer>/nccl_2.2.13-1+cuda9.0_x86_64/lib/../include/nccl.h
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,7.0]: 3.7


Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /sw/compiler/gcc-6.4.0/bin/gcc]:


Do you wish to build TensorFlow with MPI support? [y/N]: y
MPI support will be enabled for TensorFlow.

Please specify the MPI toolkit folder. [Default is /sw/env/cuda-9.0.176_gcc-6.4.0/openmpi/2.1.2]:


Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: -march=haswell


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
        --config=mkl            # Build with MKL support.
        --config=monolithic     # Config for mostly static monolithic build.
        --config=gdr            # Build with GDR support.
        --config=verbs          # Build with libverbs support.
        --config=ngraph         # Build with Intel nGraph support.
Configuration finished


Der -march=... bestimmt, für welche CPU-Architektur Tensorflow optimiert werden soll.
Der Standardwert -march=native wählt die aktuelle Architektur aus, was in unserem Fall die CPU des Build-Servers wäre.
Um den korrekten Wert für die GPU-Nodes auszuwählen, kann der Befehl gcc -march=native -Q --help=target|grep march direkt auf einer GPU-Node ausgeführt werden.
Das Ergebnis auf der Node, die ich getestet habe war haswell.
Tensorflow bauen

Zuerst muss noch das Bazel Build-Script angepasst werden.
Referenz: Bazel Issue #4053.
$ mv tensorflow/tensorflow.bzl tensorflow/tensorflow.bzl.old
$ sed '1300 i \        use_default_shell_env = True,' tensorflow/tensorflow.bzl.old > tensorflow/tensorflow.bzl
Jetzt wird nur noch

Mit Bazel das Build Script für das lokale pip Package erstellt
Das pip Package gebaut
Und installiert

$ bazel --output_user_root=$WORK/bazel/ build --local_resources 32768,32.0,1.0 --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ mkdir $WORK/tensorflow_pkg
$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package $WORK/tensorflow_pkg
$ pip install $WORK/tensorflow_pkg/tensorflow-1.12.2-cp36-cp36-linux_x86_64.whl
Je nach aktueller Auslastung des Front-Knotens muss ggf. in dem bazel Befehl die RAM-Begrenzung angepasst werden.
Als ich es gebaut habe waren ca. 45GB frei und ich habe die Begrenzung auf 32GB gesetzt (im Befehl: 32768 MB).
Ist irgendwann während des Befehls zu wenig verfügbar, wird die Ausführung abgebrochen.
Horovod

$ HOROVOD_NCCL_HOME=$HOME/nccl_2.2.13-1+cuda9.0_x86_64 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
Testen

$ cd ~
$ module load git/2.21.0
$ git clone https://github.com/tensorflow/benchmarks.git
$ cd benchmarks
$ git checkout cnn_tf_v1.12_compatible
$ sa 4
$ source ~/venvs/spon/bin/activate
$ shvd python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --optimizer sgd --variable_update horovod
Pro GPU sind ca. 50 Bilder/sek zu erwarten.
Das Gesamtergebnis ("total images/sec") sollte bei 4 Knoten / 8 GPUs lag bei mir bei ca. 425.
Die Werte sollten zu den K80 Richtwerten von Tensorflow vergleichbar sein.
Batch-Jobs

Ein Beispiel für einen Batch-Job mit Horovod:
#!/bin/bash
#SBATCH --job-name=spon
#SBATCH --nodes=4
#SBATCH --partition=gpu
#SBATCH --ntasks-per-node=2
#SBATCH --output="<nutzer>/batch/slurm-%j.out"
#SBATCH --export=NONE

set -e # Stop operation on first error

source /sw/batch/init.sh

# Environment modules
module switch env env/2019Q1-cuda-gcc-openmpi

# Run model
mpirun \
      -bind-to none -map-by slot \
      -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
      -x NCCL_P2P_DISABLE=1 \
      -mca pml ob1 -mca btl ^openib \
      python main.py
Dieser sollte als bspw. run.sh in dem gleichen Ordner wie das auszuführende Python Script gespeichert werden.
Man kann den Job dann über den Front-Server abschicken:
$ sb run.sh
Die Ausführung wird von Slurm verwaltet.
Es ist dementsprechend ohne Probleme möglich, die Echtzeitausgabe mittels CTRL+C zu beenden oder das Terminal komplett zu schließen.
Tensorboard

Über Local-Port-Forwarding ist es möglich, direkt auf die auf Hummel liegenden Tensorboard-Logs zuzugreifen.
Dazu muss man in einem ersten Terminal über ssh hf eine Verbindung zu dem Hummel-Front Server herstellen, um anschließend mit tensorboard --port <port> --logdir=<pfad> den Tensorboard-Server zu starten.
Anschließend kann ein zweites Terminal für das eigentliche Port-Forwarding über das Hummel-Gateway verwendet werden: ssh -L 6060:node002:<port> hg.
Der <port> muss der gleiche sein, wie für den Tensorboard-Server verwendet wurde.
Als Knoten (hier: node002) muss der Hostname des verwendeten Hummel-Front Servers angegeben werden.
Für front1 ist das node001, für front2 (wie in der ganz oben angegebenen SSH Konfiguration) entsprechend node002.
Beide Terminals müssen geöffnet bleiben.