Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
How to setup Tensorflow Jupyter Notebook on Intel Nervana AI Cluster (Colfax) For Deep Learning

Introduction - Tensorflow Jupyter Notebook on HPC Cluster

Tensorflow is a one of the most popular modern Deep Learning frameworks that puts production deployment in mind and has strong support for both CPU and GPU. The core API is Python (though it also supports other languages). It features strong GPU acceleration for tensor computations, and static graph capability. I like Python and Deep Learning, and so I decided to give it a go.

At the time of writing this, I have access to a Macbook Pro (which has a Intel i7 CPU and no Nvidia GPU), and a Dual bootable Windows/Linux PC (which has a i7 CPU and Nvidia Geforce 750Ti GTX - 2 GB RAM). Though it would be ok for me to run some simple training models on these machines for education purpose, I am keen to explore performing reserach and developing on a cloud / HPC (High Performance Computing) environment - mainly motivated by scalability, performance, and production-readiness.

It happens that as part of the Intel Software Innovator Programme, Intel has kindly granted me access to the Intel Nervana AI HPC Cluster (aka Colfax Cluster). Each cluster node is powered by Intel Xeon / Xeon Phi processors. Distributed Computing is also possible. Although in the absence of Nvidia GPUs (which everybody seems to love in the Deep Learning world at the moment), this Intel Xeon Phi CPU Option could be a potential viable alternative. In addition, as Tensorflow puts production deployment in mind and has strong Open Source Governance so I thought it might worth this trying on the HPC cluster environment.

Jupyter Notebook has been the key tool for me to learn Python, Data Science, and Deep Learning easily - it's made rapid iterations super easy. And I am very keen to continue to use Jupyter Notebook as a starting point in building production-ready applications. Though at some point of the production pipeline we may move away from Jupyter Notebook to other tools (such as web server), Jupyter Notebook is a great starting point.

In this post, we will walk through the steps needed to setup and access a Jupyter notebook that runs on an Intel Xeon Phi Cluster Node, wrapped inside a Python 3.6 and Tensorflow Conda Environment. We will see that to switch to other Python versions is just a matter of tweaking the conda environment setup steps. Though I may try writing a similar post for other frameworks, we shall see that the steps will be pretty much identical.

GitHub Repository

I've included the files as mentioned in this article in this GitHub Repository. This includes:

  • the conda environment file
  • the shell script (that we qsub to a cluster node to start up a Jupyter notebook)
  • a sample Tensorflow Jupyter Notebook

That said I've also included the content of the files within this article directly.

Instruction

Step 0 - SSH to Remote Login Node

It is assumed that you already have access to an Intel AI Cluster (aka Colfax Cluster). You are able to remote connect to the Cluster login node with this command and have some familiarity navigating around the file system.

$ ssh colfax

If not, this article might not be for you (yet). To get (short term) access and have a play with running Deep Learning frameworks on Intel Xeon / Xeon Phi processors you can either:

Step 1 - Create Conda Environment

We can create the conda environment via two methods (up to your taste).

Step 1 - Option 1 - Do this via commands interactively

Create Conda environment:

[u4443@c001]$ conda create --name tensorflow-36 python=3.6 -c intel

Install Jupyter:

[u4443@c001]$ conda install jupyter -c intel

Install Tensorflow - but Intel Distribution:

[u4443@c001]$ conda install tensorflow -c intel

Note that we can keep adding packages by doing conda install some-cool-packages.

Step 1 - Option 2 - Do this in one go via an environment file

Create Conda environment file tensorflow-36.yml (tweak as desired):

name: tensorflow-36
channels:
  - intel
dependencies:
  - python=3.6   # your choice
  - jupyter
  - tensorflow

(Note that we can add more packages by appending the list under dependencies).

Create Conda environment:

[u4443@c001]$ conda env create -f tensorflow-36.yml

Step 2 - check Conda Environment

Activate the conda environment:

[u4443@c001]$ source activate tensorflow-36

Check what packages we have in the conda environment:

(tensorflow-36) [u4443@c001]$ conda list

We shall see something like the followings. Note that pretty much all packages are from the "Intel" Channel.

# packages in environment at /home/u4443/.conda/envs/tensorflow-36:
#
backports                 1.0                py36_intel_6  [intel]  intel
bleach                    1.5.0              py36_intel_0  [intel]  intel
decorator                 4.0.11             py36_intel_1  [intel]  intel
entrypoints               0.2.2              py36_intel_2  [intel]  intel
get_terminal_size         1.0.0              py36_intel_5  [intel]  intel
html5lib                  0.999              py36_intel_0  [intel]  intel
icc_rt                    16.0.3                 intel_14  [intel]  intel
intelpython               2018.0.0                      3    intel
ipykernel                 4.6.1              py36_intel_0  [intel]  intel
ipython                   6.1.0              py36_intel_0  [intel]  intel
ipython_genutils          0.2.0              py36_intel_0  [intel]  intel
ipywidgets                6.0.0              py36_intel_0  [intel]  intel
jinja2                    2.9.6              py36_intel_0  [intel]  intel
jsonschema                2.6.0              py36_intel_0  [intel]  intel
jupyter                   1.0.0              py36_intel_5  [intel]  intel
jupyter_client            5.1.0              py36_intel_0  [intel]  intel
jupyter_console           5.1.0              py36_intel_0  [intel]  intel
jupyter_core              4.3.0              py36_intel_1  [intel]  intel
libsodium                 1.0.10                  intel_6  [intel]  intel
markupsafe                0.23               py36_intel_6  [intel]  intel
mistune                   0.7.4              py36_intel_1  [intel]  intel
mkl                       2018.0.0                intel_4    intel
mock                      2.0.0              py36_intel_4  [intel]  intel
nbconvert                 5.2.1              py36_intel_0  [intel]  intel
nbformat                  4.3.0              py36_intel_0  [intel]  intel
notebook                  5.0.0              py36_intel_0  [intel]  intel
numpy                     1.13.1            py36_intel_16  [intel]  intel
openmp                    2018.0.0                intel_7    intel
openssl                   1.0.2k                  intel_3  [intel]  intel
pandocfilters             1.4.1              py36_intel_0  [intel]  intel
path.py                   10.3.1             py36_intel_0  [intel]  intel
pbr                       1.10.0             py36_intel_4  [intel]  intel
pexpect                   4.2.1              py36_intel_1  [intel]  intel
pickleshare               0.7.4              py36_intel_1  [intel]  intel
pip                       9.0.1              py36_intel_0  [intel]  intel
prompt_toolkit            1.0.14             py36_intel_0  [intel]  intel
protobuf                  3.2.0              py36_intel_0  [intel]  intel
ptyprocess                0.5.1              py36_intel_5  [intel]  intel
pygments                  2.2.0              py36_intel_1  [intel]  intel
python                    3.6.2                   intel_3  [intel]  intel
python-dateutil           2.6.0              py36_intel_3  [intel]  intel
pyzmq                     16.0.2             py36_intel_3  [intel]  intel
setuptools                27.2.0             py36_intel_0  [intel]  intel
simplegeneric             0.8.1              py36_intel_5  [intel]  intel
six                       1.10.0             py36_intel_8  [intel]  intel
sqlite                    3.13.0                 intel_15  [intel]  intel
tcl                       8.6.4                  intel_17  [intel]  intel
tensorflow                1.2.1               np113py36_1    intel
terminado                 0.6                py36_intel_6  [intel]  intel
testpath                  0.3.1              py36_intel_0  [intel]  intel
tk                        8.6.4                  intel_26  [intel]  intel
tornado                   4.5.1              py36_intel_0  [intel]  intel
traitlets                 4.3.2              py36_intel_1  [intel]  intel
wcwidth                   0.1.7              py36_intel_5  [intel]  intel
weakref                   1.0rc1                   py36_2    intel
werkzeug                  0.12.2                   py36_1    intel
wheel                     0.29.0             py36_intel_5  [intel]  intel
widgetsnbextension        2.0.0                    py36_2    intel
xz                        5.2.2                  intel_16  [intel]  intel
zeromq                    4.1.5                   intel_0  [intel]  intel
zlib                      1.2.11                  intel_3  [intel]  intel

Step 3 - Setup Jupyter Notebook Config

Create a Jupyter notebook config file:

(tensorflow-36) [u4443@c001]$ jupyter notebook --generate-config

This will create a Jupyter config file ~/.jupyter/jupyter_notebook_config.py (all lines commented out by default).

Start a Python interactive console:

(tensorflow-36) [u4443@c001]$ python

Create a password for your Jupyter Notebook (so only you can access it):

>>> from notebook.auth import passwd; passwd()
Enter password: 
Verify password: 
'sha1:fe4d...f34:30cb0e...ag36'

The last line is the encrpyted password. Copy this.

Add the following block at the top of the newly generated Jupyter config file ~/.jupyter/jupyter_notebook_config.py - paste the password here. Also, pick a port number of your liking (the higher the better - to avoid port collision)

# The string should be of the form type:salt:hashed-password.
c.NotebookApp.password = 'sha1:fe4d...f34:30cb0e...ag36'
# The port the notebook server will listen on.
c.NotebookApp.port = 9213

Step 4 - Start a Jupyter Notebook on a Cluster Node

Create a shell script jup.sh:

#PBS nodes=1:knl:flat
source ~/.conda/envs/tensorflow-36/bin/activate tensorflow-36 && jupyter notebook --no-browser

What this shell script does:

  • line 1: specify we want to request 1 Knights Landing (KNL) Node to run our job on. We want flat architecture (optional).
  • line 2: activate the conda environment, and start a jupyter notebook on the requested cluster node.
  • line 3: make sure we have an empty line at the bottom of the file. Apparently it wouldn't work otherwise.

Invoke the shell script on a cluster node:

(tensorflow-36) [u4443@c001]$ qsub jup.sh
23550.c001

(we should get a job number back. Something like 23550.c001).

Check that we now have a job running on a cluster node:

(tensorflow-36) [u4443@c001]$ qstat

We should see our job is running:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
23550.c001                 jup.sh           u4443                  0 R batch

Find which cluster node the job is running on:

$ qstat 23550 -f | grep exec_host

The command returns the cluster node:

exec_host = c001-n009/0

Make a note on the cluster node name. In this example, it is c001-n009.

Step 5 - Perform Port Forwarding (locally on Laptop)

From a Mac

Start up a new terminal (on laptop. Not the remote cluster.). Do a port forwarding like this:

ssh -L 9213:localhost:9213 colfax ssh -L 9213:localhost:9213 c001-n009

Observation:

  • use the port number you specified (see Step 3). In ths example: port 9213
  • use the cluster node name you identified (see Step 4). In this example: c001-n009

From a Linux Laptop/PC

Probably same as the Mac solution. Yet to test.

From a Windows Laptop/PC

I have not tested this out personally. But I am copying and pasting (with tweaks) a solution from a blog post here for reference. Yet to test.

  • Open PuTTY, load your cluster, do not open, go to Connections==>SSH==>Tunnels. Here, set the source port as 9213, and Destination as localhost:9213. Click on add. Do not change other settings.
  • You can now start your session, and activate your conda environment. source activate tensorflow-36
  • All that has to be done now is: (tensorflow-36)$ jupyter notebook --no-browser

Step 6 - Access Jupyter Notebook

What we have done so far:

  • Step 1 to 4: setup Conda environment and started Jupyter notebook on a cluster node.
  • Step 5: performed port forwarding.

We can now access the notebook via http://localhost:9213/ - change the port number as needed.

When ask for a password, just provide it (note: the non-encrypted version. i.e. the plain text password that you created in step 3).

Snapshots 1 - Jupyter Home Page:

colfax-jupyter-pytorch.png

Snapshot 2 - Create a Jupyter Notebook (at any location you like), to prove that:

  • Jupyter Notebook is indeed running on the Remote cluster note (c001-n009) in this example
  • We can import Tensorflow (import tensorflow) and run some codes.

tensorflow.png


Some sample codes here for you to try out:

# prove that we are running notebook on the cluster node
import socket
print("Running on Colfax Cluster Node: {}".format(socket.gethostname()))

# test import torch and run some codes
import tensorflow as tf

# a simple Tensorflow Hello World to prove that it is working
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

Free Up Cluster Node Resource

Good Practice: When you are done, save your Jupyter Notebook and delete the job on the cluster node - this frees up resources and avoid "server hoarding" (Though this may require validation I tend to do this anyway just in case!)

qdel 23550

(just replace the job id 23550 to your job number).

What about other Deep Learning Framework?

If you have walked through the steps above, you will clearly observe the steps for setting up other frameworks will be almost identical. Simply tweak the conda setup step to install other framework instead of tensorflow. And activate that conda environment and run from there. Try out some codes for that framework and it should work.

Conclusion

In this article we have illustration the steps to follow to (1) create conda environment with Intel Distribution Python and Jupyter, and TensorFlow, (2) submit shell script to a cluster node to start a Jupyter notebook via qsub, and (3) access Jupyter Notebook via Port forwarding technique.

References


Deep Learning on Intel Nervana AI Cluster (aka Colfax HPC Cluster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment