PyTorch is a modern Deep Learning framework that puts Python first. It features strong GPU acceleration (and moderate CPU support) for tensor computations, and dynamic graph capability. I like Python and Deep Learning, and so I decided to give it a go.
At the time of writing this, I have access to a Macbook Pro (which has a Intel i7 CPU and no Nvidia GPU), and a Dual bootable Windows/Linux PC (which has a i7 CPU and Nvidia Geforce 750Ti GTX - 2 GB RAM). Though it would be ok for me to run some simple training models on these machines for education purpose, I am keen to explore performing reserach and developing on a cloud / HPC (High Performance Computing) environment - mainly motivated by scalability, performance, and production-readiness.
It happens that as part of the Intel Software Innovator Programme, Intel has kindly granted me access to the Intel Nervana AI HPC Cluster (aka Colfax Cluster). Each cluster node is powered by Intel Xeon / Xeon Phi processors. Distributed Computing is also possible. Although in the absence of Nvidia GPUs (which everybody seems to love in the Deep Learning world at the moment), this Intel Xeon Phi CPU Option could be a potential viable alternative. In addition, as PyTorch is still new I thought it might worth this trying on the HPC cluster environment.
Jupyter Notebook has been the key tool for me to learn Python, Data Science, and Deep Learning easily - it's made rapid iterations super easy. And I am very keen to continue to use Jupyter Notebook as a starting point in building production-ready applications. Though at some point of the production pipeline we may move away from Jupyter Notebook to other tools (such as web server), Jupyter Notebook is a great starting point.
In this post, we will walk through the steps needed to setup and access a Jupyter notebook that runs on an Intel Xeon Phi Cluster Node, wrapped inside a Python 3.6 and PyTorch Conda Environment. We will see that to switch to other Python versions is just a matter of tweaking the conda environment setup steps. Though I may try writing a similar post for other frameworks such as tensorflow, we shall see that the steps will be pretty much identical.
I've included the files as mentioned in this article in this GitHub Repository. This includes:
- the conda environment file
- the shell script (that we qsub to a cluster node to start up a Jupyter notebook)
- a sample PyTorch Jupyter Notebook
That said I've also included the content of the files within this article directly.
It is assumed that you already have access to an Intel AI Cluster (aka Colfax Cluster). You are able to remote connect to the Cluster login node with this command and have some familiarity navigating around the file system.
$ ssh colfax
If not, this article might not be for you (yet). To get (short term) access and have a play with running Deep Learning frameworks on Intel Xeon / Xeon Phi processors you can either:
- check out the free Intel AI Nervana Academy Cloud Dev Compute for education and access. or
- check out the free Colfax Deepdive web course for education and access.
We can create the conda environment via two methods (up to your taste).
Create Conda environment:
[u4443@c001]$ conda create --name pytorch-36 python=3.6 -c intel
Install Jupyter:
[u4443@c001]$ conda install jupyter -c intel
Install Pytorch:
[u4443@c001]$ conda install pytorch torchvision -c soumith
Note that we can keep adding packages by doing conda install some-cool-packages
.
Create Conda environment file pytorch-36.yml
(tweak as desired):
name: pytorch-36
channels:
- intel
- soumith # pytorch
dependencies:
- python=3.6 # your choice
- jupyter
- pytorch # pytorch
- torchvision # pytorch
(Note that we can add more packages by appending the list under dependencies
).
Create Conda environment:
[u4443@c001]$ conda env create -f pytorch-36.yml
Activate the conda environment:
[u4443@c001]$ source activate pytorch-36
Check what packages we have in the conda environment:
(pytorch-36) [u4443@c001]$ conda list
We shall see something like the followings. Note that pretty much all packages are from the "Intel" Channel. Only PyTorch related packages are Non-Intel (soumith
- a channel by pytorch.org).
# packages in environment at /home/u4443/.conda/envs/pytorch-36:
#
backports 1.0 py36_intel_6 [intel] intel
bleach 1.5.0 py36_intel_0 [intel] intel
cffi 1.10.0 py36_intel_0 [intel] intel
decorator 4.0.11 py36_intel_1 [intel] intel
entrypoints 0.2.2 py36_intel_2 [intel] intel
freetype 2.8 intel_0 [intel] intel
get_terminal_size 1.0.0 py36_intel_5 [intel] intel
html5lib 0.999 py36_intel_0 [intel] intel
icc_rt 16.0.3 intel_14 [intel] intel
intelpython 2018.0.0 3 intel
ipykernel 4.6.1 py36_intel_0 [intel] intel
ipython 6.1.0 py36_intel_0 [intel] intel
ipython_genutils 0.2.0 py36_intel_0 [intel] intel
ipywidgets 6.0.0 py36_intel_0 [intel] intel
jinja2 2.9.6 py36_intel_0 [intel] intel
jpeg 9b intel_0 [intel] intel
jsonschema 2.6.0 py36_intel_0 [intel] intel
jupyter 1.0.0 py36_intel_5 [intel] intel
jupyter_client 5.1.0 py36_intel_0 [intel] intel
jupyter_console 5.1.0 py36_intel_0 [intel] intel
jupyter_core 4.3.0 py36_intel_1 [intel] intel
libffi 3.2.1 intel_4 [intel] intel
libgcc 5.2.0 0
libpng 1.6.30 intel_0 [intel] intel
libsodium 1.0.10 intel_6 [intel] intel
libtiff 4.0.8 intel_1 [intel] intel
markupsafe 0.23 py36_intel_6 [intel] intel
mistune 0.7.4 py36_intel_1 [intel] intel
mkl 2018.0.0 intel_4 intel
nbconvert 5.2.1 py36_intel_0 [intel] intel
nbformat 4.3.0 py36_intel_0 [intel] intel
notebook 5.0.0 py36_intel_0 [intel] intel
numpy 1.13.1 py36_intel_16 [intel] intel
olefile 0.44 py36_intel_0 [intel] intel
openmp 2018.0.0 intel_7 intel
openssl 1.0.2k intel_3 [intel] intel
pandocfilters 1.4.1 py36_intel_0 [intel] intel
path.py 10.3.1 py36_intel_0 [intel] intel
pexpect 4.2.1 py36_intel_1 [intel] intel
pickleshare 0.7.4 py36_intel_1 [intel] intel
pillow 4.2.1 py36_intel_0 [intel] intel
pip 9.0.1 py36_intel_0 [intel] intel
prompt_toolkit 1.0.14 py36_intel_0 [intel] intel
ptyprocess 0.5.1 py36_intel_5 [intel] intel
pycparser 2.17 py36_intel_0 [intel] intel
pygments 2.2.0 py36_intel_1 [intel] intel
python 3.6.2 intel_3 [intel] intel
python-dateutil 2.6.0 py36_intel_3 [intel] intel
pytorch 0.2.0 py36hf0d2509_4cu75 soumith
pyzmq 16.0.2 py36_intel_3 [intel] intel
setuptools 27.2.0 py36_intel_0 [intel] intel
simplegeneric 0.8.1 py36_intel_5 [intel] intel
six 1.10.0 py36_intel_8 [intel] intel
sqlite 3.13.0 intel_15 [intel] intel
tcl 8.6.4 intel_17 [intel] intel
terminado 0.6 py36_intel_6 [intel] intel
testpath 0.3.1 py36_intel_0 [intel] intel
tk 8.6.4 intel_26 [intel] intel
torchvision 0.1.9 py36h7584368_1 soumith
tornado 4.5.1 py36_intel_0 [intel] intel
traitlets 4.3.2 py36_intel_1 [intel] intel
wcwidth 0.1.7 py36_intel_5 [intel] intel
wheel 0.29.0 py36_intel_5 [intel] intel
widgetsnbextension 2.0.0 py36_2 intel
xz 5.2.2 intel_16 [intel] intel
zeromq 4.1.5 intel_0 [intel] intel
zlib 1.2.11 intel_3 [intel] intel
(pytorch-36) [u4443@c001]$
Create a Jupyter notebook config file:
(pytorch-36) [u4443@c001]$ jupyter notebook --generate-config
This will create a Jupyter config file ~/.jupyter/jupyter_notebook_config.py
(all lines commented out by default).
Start a Python interactive console:
(pytorch-36) [u4443@c001]$ python
Create a password for your Jupyter Notebook (so only you can access it):
>>> from notebook.auth import passwd; passwd()
Enter password:
Verify password:
'sha1:fe4d...f34:30cb0e...ag36'
The last line is the encrpyted password. Copy this.
Add the following block at the top of the newly generated Jupyter config file ~/.jupyter/jupyter_notebook_config.py
- paste the password here. Also, pick a port number of your liking (the higher the better - to avoid port collision)
# The string should be of the form type:salt:hashed-password.
c.NotebookApp.password = 'sha1:fe4d...f34:30cb0e...ag36'
# The port the notebook server will listen on.
c.NotebookApp.port = 9213
Create a shell script jup.sh
:
#PBS nodes=1:knl:flat
source ~/.conda/envs/pytorch-36/bin/activate pytorch-36 && jupyter notebook --no-browser
What this shell script does:
- line 1: specify we want to request 1 Knights Landing (KNL) Node to run our job on. We want
flat
architecture (optional). - line 2: activate the conda environment, and start a jupyter notebook on the requested cluster node.
- line 3: make sure we have an empty line at the bottom of the file. Apparently it wouldn't work otherwise.
Invoke the shell script on a cluster node:
(pytorch-36) [u4443@c001]$ qsub jup.sh
23548.c001
(we should get a job number back. Something like 23548.c001
).
Check that we now have a job running on a cluster node:
(pytorch-36) [u4443@c001]$ qstat
We should see our job is running:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
23548.c001 jup.sh u4443 00:00:12 R batch
Find which cluster node the job is running on:
$ qstat 23548 -f | grep exec_host
The command returns the cluster node:
exec_host = c001-n009/0
Make a note on the cluster node name. In this example, it is c001-n009
.
Start up a new terminal (on laptop. Not the remote cluster.). Do a port forwarding like this:
ssh -L 9213:localhost:9213 colfax ssh -L 9213:localhost:9213 c001-n009
Observation:
- use the port number you specified (see Step 3). In ths example: port 9213
- use the cluster node name you identified (see Step 4). In this example: c001-n009
Probably same as the Mac solution. Yet to test.
I have not tested this out personally. But I am copying and pasting (with tweaks) a solution from a blog post here for reference. Yet to test.
- Open PuTTY, load your cluster, do not open, go to
Connections==>SSH==>Tunnels
. Here, set the source port as 9213, and Destination as localhost:9213. Click on add. Do not change other settings. - You can now start your session, and activate your conda environment.
source activate pytorch-36
- All that has to be done now is:
(pytorch-36)$ jupyter notebook --no-browser
What we have done so far:
- Step 1 to 4: setup Conda environment and started Jupyter notebook on a cluster node.
- Step 5: performed port forwarding.
We can now access the notebook via http://localhost:9213/ - change the port number as needed.
When ask for a password, just provide it (note: the non-encrypted version. i.e. the plain text password that you created in step 3).
Snapshots 1 - Jupyter Home Page:
Snapshot 2 - Create a Jupyter Notebook (at any location you like), to prove that:
- Jupyter Notebook is indeed running on the Remote cluster note (c001-n009) in this example
- We can import Pytorch (import
torch
) and run some codes.
Some sample codes here for you to try out:
import socket
print("Running on Colfax Cluster Node: {}".format(socket.gethostname()))
# test import torch and run some codes
import torch
a = torch.randn(5, 7)
print(a)
print(a.size())
Good Practice: When you are done, save your Jupyter Notebook and delete the job on the cluster node - this frees up resources and avoid "server hoarding" (Though this may require validation I tend to do this anyway just in case!)
qdel 23548
(just replace the job id 23548
to your job number).
If you have walked through the steps above, you will clearly observe the steps for setting up (say) tensorflow will be almost identical. Simply tweak the conda setup step to install tensorflow instead of pytorch. And activate the tensorflow conda environment and run from there. Try out some tensorflow codes and it should work. (That said I might write a similar article for tensorflow at some point as tensorflow has a big user base and very production-ready.)
In this article we have illustration the steps to follow to (1) create conda environment with Intel Distribution Python and Jupyter, and Non-Intel Distribution PyTorch, (2) submit shell script to a cluster node to start a Jupyter notebook via qsub, and (3) access Jupyter Notebook via Port forwarding technique.
- Connecting jupyter notebook on compute node: from a Colfax forum.
- How to start with python on Colfax Cluster: from a Kaggle competition forum that uses the Intel AI Cluster. Covers how to start and access jupyter notebook from Colfax cluster via SSH tunnelling.
- Notes on Starting with Deep Learning with Python on HPC Cluster: covers environment setup and jupyter notebook running on Colfax cluster.
Deep Learning on Intel Nervana AI Cluster (aka Colfax HPC Cluster)