Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
How to setup PyTorch Jupyter Notebook on Intel Nervana AI Cluster (Colfax) For Deep Learning

Introduction - PyTorch Jupyter Notebook on HPC Cluster

PyTorch is a modern Deep Learning framework that puts Python first. It features strong GPU acceleration (and moderate CPU support) for tensor computations, and dynamic graph capability. I like Python and Deep Learning, and so I decided to give it a go.

At the time of writing this, I have access to a Macbook Pro (which has a Intel i7 CPU and no Nvidia GPU), and a Dual bootable Windows/Linux PC (which has a i7 CPU and Nvidia Geforce 750Ti GTX - 2 GB RAM). Though it would be ok for me to run some simple training models on these machines for education purpose, I am keen to explore performing reserach and developing on a cloud / HPC (High Performance Computing) environment - mainly motivated by scalability, performance, and production-readiness.

It happens that as part of the Intel Software Innovator Programme, Intel has kindly granted me access to the Intel Nervana AI HPC Cluster (aka Colfax Cluster). Each cluster node is powered by Intel Xeon / Xeon Phi processors. Distributed Computing is also possible. Although in the absence of Nvidia GPUs (which everybody seems to love in the Deep Learning world at the moment), this Intel Xeon Phi CPU Option could be a potential viable alternative. In addition, as PyTorch is still new I thought it might worth this trying on the HPC cluster environment.

Jupyter Notebook has been the key tool for me to learn Python, Data Science, and Deep Learning easily - it's made rapid iterations super easy. And I am very keen to continue to use Jupyter Notebook as a starting point in building production-ready applications. Though at some point of the production pipeline we may move away from Jupyter Notebook to other tools (such as web server), Jupyter Notebook is a great starting point.

In this post, we will walk through the steps needed to setup and access a Jupyter notebook that runs on an Intel Xeon Phi Cluster Node, wrapped inside a Python 3.6 and PyTorch Conda Environment. We will see that to switch to other Python versions is just a matter of tweaking the conda environment setup steps. Though I may try writing a similar post for other frameworks such as tensorflow, we shall see that the steps will be pretty much identical.

GitHub Repository

I've included the files as mentioned in this article in this GitHub Repository. This includes:

  • the conda environment file
  • the shell script (that we qsub to a cluster node to start up a Jupyter notebook)
  • a sample PyTorch Jupyter Notebook

That said I've also included the content of the files within this article directly.

Instruction

Step 0 - SSH to Remote Login Node

It is assumed that you already have access to an Intel AI Cluster (aka Colfax Cluster). You are able to remote connect to the Cluster login node with this command and have some familiarity navigating around the file system.

$ ssh colfax

If not, this article might not be for you (yet). To get (short term) access and have a play with running Deep Learning frameworks on Intel Xeon / Xeon Phi processors you can either:

Step 1 - Create Conda Environment

We can create the conda environment via two methods (up to your taste).

Step 1 - Option 1 - Do this via commands interactively

Create Conda environment:

[u4443@c001]$ conda create --name pytorch-36 python=3.6 -c intel

Install Jupyter:

[u4443@c001]$ conda install jupyter -c intel

Install Pytorch:

[u4443@c001]$ conda install pytorch torchvision -c soumith

Note that we can keep adding packages by doing conda install some-cool-packages.

Step 1 - Option 2 - Do this in one go via an environment file

Create Conda environment file pytorch-36.yml (tweak as desired):

name: pytorch-36
channels:
  - intel
  - soumith      # pytorch
dependencies:
  - python=3.6   # your choice
  - jupyter
  - pytorch      # pytorch
  - torchvision  # pytorch

(Note that we can add more packages by appending the list under dependencies).

Create Conda environment:

[u4443@c001]$ conda env create -f pytorch-36.yml

Step 2 - check Conda Environment

Activate the conda environment:

[u4443@c001]$ source activate pytorch-36

Check what packages we have in the conda environment:

(pytorch-36) [u4443@c001]$ conda list

We shall see something like the followings. Note that pretty much all packages are from the "Intel" Channel. Only PyTorch related packages are Non-Intel (soumith - a channel by pytorch.org).

# packages in environment at /home/u4443/.conda/envs/pytorch-36:
#
backports                 1.0                py36_intel_6  [intel]  intel
bleach                    1.5.0              py36_intel_0  [intel]  intel
cffi                      1.10.0             py36_intel_0  [intel]  intel
decorator                 4.0.11             py36_intel_1  [intel]  intel
entrypoints               0.2.2              py36_intel_2  [intel]  intel
freetype                  2.8                     intel_0  [intel]  intel
get_terminal_size         1.0.0              py36_intel_5  [intel]  intel
html5lib                  0.999              py36_intel_0  [intel]  intel
icc_rt                    16.0.3                 intel_14  [intel]  intel
intelpython               2018.0.0                      3    intel
ipykernel                 4.6.1              py36_intel_0  [intel]  intel
ipython                   6.1.0              py36_intel_0  [intel]  intel
ipython_genutils          0.2.0              py36_intel_0  [intel]  intel
ipywidgets                6.0.0              py36_intel_0  [intel]  intel
jinja2                    2.9.6              py36_intel_0  [intel]  intel
jpeg                      9b                      intel_0  [intel]  intel
jsonschema                2.6.0              py36_intel_0  [intel]  intel
jupyter                   1.0.0              py36_intel_5  [intel]  intel
jupyter_client            5.1.0              py36_intel_0  [intel]  intel
jupyter_console           5.1.0              py36_intel_0  [intel]  intel
jupyter_core              4.3.0              py36_intel_1  [intel]  intel
libffi                    3.2.1                   intel_4  [intel]  intel
libgcc                    5.2.0                         0
libpng                    1.6.30                  intel_0  [intel]  intel
libsodium                 1.0.10                  intel_6  [intel]  intel
libtiff                   4.0.8                   intel_1  [intel]  intel
markupsafe                0.23               py36_intel_6  [intel]  intel
mistune                   0.7.4              py36_intel_1  [intel]  intel
mkl                       2018.0.0                intel_4    intel
nbconvert                 5.2.1              py36_intel_0  [intel]  intel
nbformat                  4.3.0              py36_intel_0  [intel]  intel
notebook                  5.0.0              py36_intel_0  [intel]  intel
numpy                     1.13.1            py36_intel_16  [intel]  intel
olefile                   0.44               py36_intel_0  [intel]  intel
openmp                    2018.0.0                intel_7    intel
openssl                   1.0.2k                  intel_3  [intel]  intel
pandocfilters             1.4.1              py36_intel_0  [intel]  intel
path.py                   10.3.1             py36_intel_0  [intel]  intel
pexpect                   4.2.1              py36_intel_1  [intel]  intel
pickleshare               0.7.4              py36_intel_1  [intel]  intel
pillow                    4.2.1              py36_intel_0  [intel]  intel
pip                       9.0.1              py36_intel_0  [intel]  intel
prompt_toolkit            1.0.14             py36_intel_0  [intel]  intel
ptyprocess                0.5.1              py36_intel_5  [intel]  intel
pycparser                 2.17               py36_intel_0  [intel]  intel
pygments                  2.2.0              py36_intel_1  [intel]  intel
python                    3.6.2                   intel_3  [intel]  intel
python-dateutil           2.6.0              py36_intel_3  [intel]  intel
pytorch                   0.2.0           py36hf0d2509_4cu75    soumith
pyzmq                     16.0.2             py36_intel_3  [intel]  intel
setuptools                27.2.0             py36_intel_0  [intel]  intel
simplegeneric             0.8.1              py36_intel_5  [intel]  intel
six                       1.10.0             py36_intel_8  [intel]  intel
sqlite                    3.13.0                 intel_15  [intel]  intel
tcl                       8.6.4                  intel_17  [intel]  intel
terminado                 0.6                py36_intel_6  [intel]  intel
testpath                  0.3.1              py36_intel_0  [intel]  intel
tk                        8.6.4                  intel_26  [intel]  intel
torchvision               0.1.9            py36h7584368_1    soumith
tornado                   4.5.1              py36_intel_0  [intel]  intel
traitlets                 4.3.2              py36_intel_1  [intel]  intel
wcwidth                   0.1.7              py36_intel_5  [intel]  intel
wheel                     0.29.0             py36_intel_5  [intel]  intel
widgetsnbextension        2.0.0                    py36_2    intel
xz                        5.2.2                  intel_16  [intel]  intel
zeromq                    4.1.5                   intel_0  [intel]  intel
zlib                      1.2.11                  intel_3  [intel]  intel
(pytorch-36) [u4443@c001]$

Step 3 - Setup Jupyter Notebook Config

Create a Jupyter notebook config file:

(pytorch-36) [u4443@c001]$ jupyter notebook --generate-config

This will create a Jupyter config file ~/.jupyter/jupyter_notebook_config.py (all lines commented out by default).

Start a Python interactive console:

(pytorch-36) [u4443@c001]$ python

Create a password for your Jupyter Notebook (so only you can access it):

>>> from notebook.auth import passwd; passwd()
Enter password: 
Verify password: 
'sha1:fe4d...f34:30cb0e...ag36'

The last line is the encrpyted password. Copy this.

Add the following block at the top of the newly generated Jupyter config file ~/.jupyter/jupyter_notebook_config.py - paste the password here. Also, pick a port number of your liking (the higher the better - to avoid port collision)

# The string should be of the form type:salt:hashed-password.
c.NotebookApp.password = 'sha1:fe4d...f34:30cb0e...ag36'
# The port the notebook server will listen on.
c.NotebookApp.port = 9213

Step 4 - Start a Jupyter Notebook on a Cluster Node

Create a shell script jup.sh:

#PBS nodes=1:knl:flat
source ~/.conda/envs/pytorch-36/bin/activate pytorch-36 && jupyter notebook --no-browser

What this shell script does:

  • line 1: specify we want to request 1 Knights Landing (KNL) Node to run our job on. We want flat architecture (optional).
  • line 2: activate the conda environment, and start a jupyter notebook on the requested cluster node.
  • line 3: make sure we have an empty line at the bottom of the file. Apparently it wouldn't work otherwise.

Invoke the shell script on a cluster node:

(pytorch-36) [u4443@c001]$ qsub jup.sh
23548.c001

(we should get a job number back. Something like 23548.c001).

Check that we now have a job running on a cluster node:

(pytorch-36) [u4443@c001]$ qstat

We should see our job is running:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
23548.c001                 jup.sh           u4443           00:00:12 R batch

Find which cluster node the job is running on:

$ qstat 23548 -f | grep exec_host

The command returns the cluster node:

exec_host = c001-n009/0

Make a note on the cluster node name. In this example, it is c001-n009.

Step 5 - Perform Port Forwarding (locally on Laptop)

From a Mac

Start up a new terminal (on laptop. Not the remote cluster.). Do a port forwarding like this:

ssh -L 9213:localhost:9213 colfax ssh -L 9213:localhost:9213 c001-n009

Observation:

  • use the port number you specified (see Step 3). In ths example: port 9213
  • use the cluster node name you identified (see Step 4). In this example: c001-n009

From a Linux Laptop/PC

Probably same as the Mac solution. Yet to test.

From a Windows Laptop/PC

I have not tested this out personally. But I am copying and pasting (with tweaks) a solution from a blog post here for reference. Yet to test.

  • Open PuTTY, load your cluster, do not open, go to Connections==>SSH==>Tunnels. Here, set the source port as 9213, and Destination as localhost:9213. Click on add. Do not change other settings.
  • You can now start your session, and activate your conda environment. source activate pytorch-36
  • All that has to be done now is: (pytorch-36)$ jupyter notebook --no-browser

Step 6 - Access Jupyter Notebook

What we have done so far:

  • Step 1 to 4: setup Conda environment and started Jupyter notebook on a cluster node.
  • Step 5: performed port forwarding.

We can now access the notebook via http://localhost:9213/ - change the port number as needed.

When ask for a password, just provide it (note: the non-encrypted version. i.e. the plain text password that you created in step 3).

Snapshots 1 - Jupyter Home Page:

colfax-jupyter-pytorch.png

Snapshot 2 - Create a Jupyter Notebook (at any location you like), to prove that:

  • Jupyter Notebook is indeed running on the Remote cluster note (c001-n009) in this example
  • We can import Pytorch (import torch) and run some codes.

colfax-jupyter.png


Some sample codes here for you to try out:

import socket
print("Running on Colfax Cluster Node: {}".format(socket.gethostname()))

# test import torch and run some codes
import torch
a = torch.randn(5, 7)
print(a)
print(a.size())

Free Up Cluster Node Resource

Good Practice: When you are done, save your Jupyter Notebook and delete the job on the cluster node - this frees up resources and avoid "server hoarding" (Though this may require validation I tend to do this anyway just in case!)

qdel 23548

(just replace the job id 23548 to your job number).

What about other Deep Learning Framework?

If you have walked through the steps above, you will clearly observe the steps for setting up (say) tensorflow will be almost identical. Simply tweak the conda setup step to install tensorflow instead of pytorch. And activate the tensorflow conda environment and run from there. Try out some tensorflow codes and it should work. (That said I might write a similar article for tensorflow at some point as tensorflow has a big user base and very production-ready.)

Conclusion

In this article we have illustration the steps to follow to (1) create conda environment with Intel Distribution Python and Jupyter, and Non-Intel Distribution PyTorch, (2) submit shell script to a cluster node to start a Jupyter notebook via qsub, and (3) access Jupyter Notebook via Port forwarding technique.

References


Deep Learning on Intel Nervana AI Cluster (aka Colfax HPC Cluster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment