Tensorflow is a one of the most popular modern Deep Learning frameworks that puts production deployment in mind and has strong support for both CPU and GPU. The core API is Python (though it also supports other languages). It features strong GPU acceleration for tensor computations, and static graph capability. I like Python and Deep Learning, and so I decided to give it a go.
At the time of writing this, I have access to a Macbook Pro (which has a Intel i7 CPU and no Nvidia GPU), and a Dual bootable Windows/Linux PC (which has a i7 CPU and Nvidia Geforce 750Ti GTX - 2 GB RAM). Though it would be ok for me to run some simple training models on these machines for education purpose, I am keen to explore performing reserach and developing on a cloud / HPC (High Performance Computing) environment - mainly motivated by scalability, performance, and production-readiness.
It happens that as part of the Intel Software Innovator Programme, Intel has kindly granted me access to the Intel Nervana AI HPC Cluster (aka Colfax Cluster). Each cluster node is powered by Intel Xeon / Xeon Phi processors. Distributed Computing is also possible. Although in the absence of Nvidia GPUs (which everybody seems to love in the Deep Learning world at the moment), this Intel Xeon Phi CPU Option could be a potential viable alternative. In addition, as Tensorflow puts production deployment in mind and has strong Open Source Governance so I thought it might worth this trying on the HPC cluster environment.
Jupyter Notebook has been the key tool for me to learn Python, Data Science, and Deep Learning easily - it's made rapid iterations super easy. And I am very keen to continue to use Jupyter Notebook as a starting point in building production-ready applications. Though at some point of the production pipeline we may move away from Jupyter Notebook to other tools (such as web server), Jupyter Notebook is a great starting point.
In this post, we will walk through the steps needed to setup and access a Jupyter notebook that runs on an Intel Xeon Phi Cluster Node, wrapped inside a Python 3.6 and Tensorflow Conda Environment. We will see that to switch to other Python versions is just a matter of tweaking the conda environment setup steps. Though I may try writing a similar post for other frameworks, we shall see that the steps will be pretty much identical.
I've included the files as mentioned in this article in this GitHub Repository. This includes:
- the conda environment file
- the shell script (that we qsub to a cluster node to start up a Jupyter notebook)
- a sample Tensorflow Jupyter Notebook
That said I've also included the content of the files within this article directly.
It is assumed that you already have access to an Intel AI Cluster (aka Colfax Cluster). You are able to remote connect to the Cluster login node with this command and have some familiarity navigating around the file system.
$ ssh colfax
If not, this article might not be for you (yet). To get (short term) access and have a play with running Deep Learning frameworks on Intel Xeon / Xeon Phi processors you can either:
- check out the free Intel AI Nervana Academy Cloud Dev Compute for education and access. or
- check out the free Colfax Deepdive web course for education and access.
We can create the conda environment via two methods (up to your taste).
Create Conda environment:
[u4443@c001]$ conda create --name tensorflow-36 python=3.6 -c intel
Install Jupyter:
[u4443@c001]$ conda install jupyter -c intel
Install Tensorflow - but Intel Distribution:
[u4443@c001]$ conda install tensorflow -c intel
Note that we can keep adding packages by doing conda install some-cool-packages
.
Create Conda environment file tensorflow-36.yml
(tweak as desired):
name: tensorflow-36
channels:
- intel
dependencies:
- python=3.6 # your choice
- jupyter
- tensorflow
(Note that we can add more packages by appending the list under dependencies
).
Create Conda environment:
[u4443@c001]$ conda env create -f tensorflow-36.yml
Activate the conda environment:
[u4443@c001]$ source activate tensorflow-36
Check what packages we have in the conda environment:
(tensorflow-36) [u4443@c001]$ conda list
We shall see something like the followings. Note that pretty much all packages are from the "Intel" Channel.
# packages in environment at /home/u4443/.conda/envs/tensorflow-36:
#
backports 1.0 py36_intel_6 [intel] intel
bleach 1.5.0 py36_intel_0 [intel] intel
decorator 4.0.11 py36_intel_1 [intel] intel
entrypoints 0.2.2 py36_intel_2 [intel] intel
get_terminal_size 1.0.0 py36_intel_5 [intel] intel
html5lib 0.999 py36_intel_0 [intel] intel
icc_rt 16.0.3 intel_14 [intel] intel
intelpython 2018.0.0 3 intel
ipykernel 4.6.1 py36_intel_0 [intel] intel
ipython 6.1.0 py36_intel_0 [intel] intel
ipython_genutils 0.2.0 py36_intel_0 [intel] intel
ipywidgets 6.0.0 py36_intel_0 [intel] intel
jinja2 2.9.6 py36_intel_0 [intel] intel
jsonschema 2.6.0 py36_intel_0 [intel] intel
jupyter 1.0.0 py36_intel_5 [intel] intel
jupyter_client 5.1.0 py36_intel_0 [intel] intel
jupyter_console 5.1.0 py36_intel_0 [intel] intel
jupyter_core 4.3.0 py36_intel_1 [intel] intel
libsodium 1.0.10 intel_6 [intel] intel
markupsafe 0.23 py36_intel_6 [intel] intel
mistune 0.7.4 py36_intel_1 [intel] intel
mkl 2018.0.0 intel_4 intel
mock 2.0.0 py36_intel_4 [intel] intel
nbconvert 5.2.1 py36_intel_0 [intel] intel
nbformat 4.3.0 py36_intel_0 [intel] intel
notebook 5.0.0 py36_intel_0 [intel] intel
numpy 1.13.1 py36_intel_16 [intel] intel
openmp 2018.0.0 intel_7 intel
openssl 1.0.2k intel_3 [intel] intel
pandocfilters 1.4.1 py36_intel_0 [intel] intel
path.py 10.3.1 py36_intel_0 [intel] intel
pbr 1.10.0 py36_intel_4 [intel] intel
pexpect 4.2.1 py36_intel_1 [intel] intel
pickleshare 0.7.4 py36_intel_1 [intel] intel
pip 9.0.1 py36_intel_0 [intel] intel
prompt_toolkit 1.0.14 py36_intel_0 [intel] intel
protobuf 3.2.0 py36_intel_0 [intel] intel
ptyprocess 0.5.1 py36_intel_5 [intel] intel
pygments 2.2.0 py36_intel_1 [intel] intel
python 3.6.2 intel_3 [intel] intel
python-dateutil 2.6.0 py36_intel_3 [intel] intel
pyzmq 16.0.2 py36_intel_3 [intel] intel
setuptools 27.2.0 py36_intel_0 [intel] intel
simplegeneric 0.8.1 py36_intel_5 [intel] intel
six 1.10.0 py36_intel_8 [intel] intel
sqlite 3.13.0 intel_15 [intel] intel
tcl 8.6.4 intel_17 [intel] intel
tensorflow 1.2.1 np113py36_1 intel
terminado 0.6 py36_intel_6 [intel] intel
testpath 0.3.1 py36_intel_0 [intel] intel
tk 8.6.4 intel_26 [intel] intel
tornado 4.5.1 py36_intel_0 [intel] intel
traitlets 4.3.2 py36_intel_1 [intel] intel
wcwidth 0.1.7 py36_intel_5 [intel] intel
weakref 1.0rc1 py36_2 intel
werkzeug 0.12.2 py36_1 intel
wheel 0.29.0 py36_intel_5 [intel] intel
widgetsnbextension 2.0.0 py36_2 intel
xz 5.2.2 intel_16 [intel] intel
zeromq 4.1.5 intel_0 [intel] intel
zlib 1.2.11 intel_3 [intel] intel
Create a Jupyter notebook config file:
(tensorflow-36) [u4443@c001]$ jupyter notebook --generate-config
This will create a Jupyter config file ~/.jupyter/jupyter_notebook_config.py
(all lines commented out by default).
Start a Python interactive console:
(tensorflow-36) [u4443@c001]$ python
Create a password for your Jupyter Notebook (so only you can access it):
>>> from notebook.auth import passwd; passwd()
Enter password:
Verify password:
'sha1:fe4d...f34:30cb0e...ag36'
The last line is the encrpyted password. Copy this.
Add the following block at the top of the newly generated Jupyter config file ~/.jupyter/jupyter_notebook_config.py
- paste the password here. Also, pick a port number of your liking (the higher the better - to avoid port collision)
# The string should be of the form type:salt:hashed-password.
c.NotebookApp.password = 'sha1:fe4d...f34:30cb0e...ag36'
# The port the notebook server will listen on.
c.NotebookApp.port = 9213
Create a shell script jup.sh
:
#PBS nodes=1:knl:flat
source ~/.conda/envs/tensorflow-36/bin/activate tensorflow-36 && jupyter notebook --no-browser
What this shell script does:
- line 1: specify we want to request 1 Knights Landing (KNL) Node to run our job on. We want
flat
architecture (optional). - line 2: activate the conda environment, and start a jupyter notebook on the requested cluster node.
- line 3: make sure we have an empty line at the bottom of the file. Apparently it wouldn't work otherwise.
Invoke the shell script on a cluster node:
(tensorflow-36) [u4443@c001]$ qsub jup.sh
23550.c001
(we should get a job number back. Something like 23550.c001
).
Check that we now have a job running on a cluster node:
(tensorflow-36) [u4443@c001]$ qstat
We should see our job is running:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
23550.c001 jup.sh u4443 0 R batch
Find which cluster node the job is running on:
$ qstat 23550 -f | grep exec_host
The command returns the cluster node:
exec_host = c001-n009/0
Make a note on the cluster node name. In this example, it is c001-n009
.
Start up a new terminal (on laptop. Not the remote cluster.). Do a port forwarding like this:
ssh -L 9213:localhost:9213 colfax ssh -L 9213:localhost:9213 c001-n009
Observation:
- use the port number you specified (see Step 3). In ths example: port 9213
- use the cluster node name you identified (see Step 4). In this example: c001-n009
Probably same as the Mac solution. Yet to test.
I have not tested this out personally. But I am copying and pasting (with tweaks) a solution from a blog post here for reference. Yet to test.
- Open PuTTY, load your cluster, do not open, go to
Connections==>SSH==>Tunnels
. Here, set the source port as 9213, and Destination as localhost:9213. Click on add. Do not change other settings. - You can now start your session, and activate your conda environment.
source activate tensorflow-36
- All that has to be done now is:
(tensorflow-36)$ jupyter notebook --no-browser
What we have done so far:
- Step 1 to 4: setup Conda environment and started Jupyter notebook on a cluster node.
- Step 5: performed port forwarding.
We can now access the notebook via http://localhost:9213/ - change the port number as needed.
When ask for a password, just provide it (note: the non-encrypted version. i.e. the plain text password that you created in step 3).
Snapshots 1 - Jupyter Home Page:
Snapshot 2 - Create a Jupyter Notebook (at any location you like), to prove that:
- Jupyter Notebook is indeed running on the Remote cluster note (c001-n009) in this example
- We can import Tensorflow (import
tensorflow
) and run some codes.
Some sample codes here for you to try out:
# prove that we are running notebook on the cluster node
import socket
print("Running on Colfax Cluster Node: {}".format(socket.gethostname()))
# test import torch and run some codes
import tensorflow as tf
# a simple Tensorflow Hello World to prove that it is working
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
Good Practice: When you are done, save your Jupyter Notebook and delete the job on the cluster node - this frees up resources and avoid "server hoarding" (Though this may require validation I tend to do this anyway just in case!)
qdel 23550
(just replace the job id 23550
to your job number).
If you have walked through the steps above, you will clearly observe the steps for setting up other frameworks will be almost identical. Simply tweak the conda setup step to install other framework instead of tensorflow. And activate that conda environment and run from there. Try out some codes for that framework and it should work.
In this article we have illustration the steps to follow to (1) create conda environment with Intel Distribution Python and Jupyter, and TensorFlow, (2) submit shell script to a cluster node to start a Jupyter notebook via qsub, and (3) access Jupyter Notebook via Port forwarding technique.
- Connecting jupyter notebook on compute node: from a Colfax forum.
- How to start with python on Colfax Cluster: from a Kaggle competition forum that uses the Intel AI Cluster. Covers how to start and access jupyter notebook from Colfax cluster via SSH tunnelling.
- Notes on Starting with Deep Learning with Python on HPC Cluster: covers environment setup and jupyter notebook running on Colfax cluster.
- How to setup PyTorch Jupyter Notebook on Intel Nervana AI Cluster (Colfax) For Deep Learning
Deep Learning on Intel Nervana AI Cluster (aka Colfax HPC Cluster)