Skip to content

Instantly share code, notes, and snippets.

@xaratustrah
Last active September 9, 2022 12:33
Show Gist options
  • Save xaratustrah/b2bf645565c45efcf1cbeffb04aa7271 to your computer and use it in GitHub Desktop.
Save xaratustrah/b2bf645565c45efcf1cbeffb04aa7271 to your computer and use it in GitHub Desktop.

Data analysis with Python + ROOT on HPC/Lustre

This a collection of information about how to setup a working environment using Python (and ROOT) for data analysis on HPC cluster.

Path conventions

By effectively using the paths you can accelerate your data analysis and developement. As a first rule, please do not use or put anything in the main user home, e.g.:

/u/<USERNAME>

Anaconda or miniconda python will be installed in the user home, you don't need to change anything there manually:

/u/<USERNAME>/anaconda3
or
/u/<USERNAME>/miniconda3

There are some libraries that are of interest to everyone and can be used globally in conjuction with the anaconda installation. Their git repositories go here:

/u/<USERNAME>/git

for your personal data of any kind you can use:

/u/<USERNAME>/personal_directories/<YOUR_NAME>

for your analysis code on HPC/Lustre you can also use a path on Lustre somethhing like the following. Here you can also put additional project specific git repositories.

/lustre/astrum/<YOUR_NAME>

Analysis data can be found on:

/lustre/ap/<USERNAME>/...

Required packages

The steps in the following section need to be done only once. And it is probably done already, so check please before repeating these steps.

Python

For the python installation it is recommended to go straight for the Anaconda or the smaller version miniconda. Just grab the latest version and run the installer in the home directory of your user. In our case:

/u/<USERNAME>/anaconda3
or
/u/<USERNAME>/miniconda3

After the installation finishes, you can proceed further to install further libraries. Some of these libraries can be installed directly via pip, some other can be installed using conda. In principle I think it is better to go for the conda variant as long as possible, then later use the pip to add missing packages. This would ensure a better compatibility with the conda echosystem. So a good starting point on a freshly installed anaconda/miniconda is basically the following line, which will also simplify the later installations of libraries:

conda install -c conda-forge numpy scipy matplotlib uproot3 pyqt && pip install pytdms

iqtools and barion

For data analysis of Schottky detectors you might need iqtools, iqgui and probably also barion. The best way to install the first two is to follow the instructions provided here. IQGui does not need any special installation but you can check the info here: here. If you already installed the main libraries as described in the previous step using conda then you don't need to repeat the same steps with pip. Barion either does not need an installation, it just needs to be pulled from GitHUB.

In general, such libraries of interest are / can be installed on the user git directory:

/u/<USERNAME>/git

But additional stuff can go into your personal directory, e.g.

/lustre/astrum/<YOUR_NAME>

An example of how to use iqtools and ROOT can be found in this jupyter notebook, and also this page. For Barion you can find an example usage in this notebook.

ROOT via conda

The easiest way to have CERN ROOT library is to install it via conda inside a so called virtual environment, as described in this tutorial. But it basically boils down to running this command only once:

conda create -n my_root_env root -c conda-forge

that is basically it. Now you can go in and out of the virtual environment by:

conda activate my_root_env

and

conda deactivate

remember, whenever you are inside an environment, you may need to install the libraries in that specific environment again for your specific project, since the idea of virtual environment is to create isolated spaces. The concept of virtual environments are indeed quite cool and very effective in programming across many projects each with different dependencies.

Getting things done

Lustre

Lustre file system is the "hard disk" of the HPC cluster, a super nice place to store data, with fast read and write cycle etc.. you can also use for data analysis, so no need to copy data around, all nicely in one place.

But, lustre has one single disadvantage: "directory listing" is super slow on lustre. This means commands like "ls", "tree" etc... will fail and any code, including your own scripts or GUI programs that try to "open" the directory on lustre issue a directory listing command. This is a problem with data from experiments with a lot of single files, like the E143 experiment.

But there is a simple trick to circumvent this problem. You can get a directory listing into a file by using this command:

find "$PWD" -iname *.tiq -type f  > ~/listing.txt

alternatively you can truncate echo:

echo * | tr ' ' '\n'

or any other variants such as xargs -n 1 etc.. as can be found here.

by doing this you can control exactly which kind of file like TIQ, TDMS etc. you take for analysis, basically by taking the individual files of that file (in this case listing.txt) for direct insertion into GUI or just by iterating in your script.

Using the libraries on the HPC

So the data are on the Lustre. If you are working on a local LXG computers, then from your LXG Linux computer make a Single jump (when you are at GSI, using the GSI computer):

ssh -X <USERNAME>@lustre.hpc.gsi.de

Now you can activate the ROOT environment

conda activate my_root_env

now you have access to ROOT and also IQTools and IQGui

You can use IQTools in your code:

from iqtools import *

also mixed with ROOT

from ROOT import TCanvas, TH2D, ...

or

from iqtools import *
import matplotlib.pyplot as plt
%matplotlib inline

from ROOT import TGraph, TFile, TCanvas, TH2F
%jsroot off

If you need, in this environment you can run the TBrowser directly in the command line:

root --web=off -e 'new TBrowser()'

Some more examples:

https://github.com/xaratustrah/iqtools/blob/main/doc/quick_introduction_iqtools.ipynb

You don't need to use iqtools with ROOT, only if you like. In which case I suggest looking at examples:

https://gist.github.com/xaratustrah/474404d56b7664ab6ad2f8130eb1331e https://github.com/xaratustrah/iqtools/blob/main/doc/rootpy_iqtools_example.ipynb

so here you have all you need. The libraries, the data and powerful computers on the HPC. Due to some version conflicts, you might need to run iqgui in a new environment. Just create a new clean Python3.9 environment and start from there.

Jupyter-Notebook over SSH-hop

Under Linux

Using Jupyter-Notebook for testing and analysing the data is very convenient. It is mainly suitable for testing the procedures, whereas the long term data analysis is probably better done outside of Jupyter-Notebook inside dedicated scripts. Nevertheless if you like to use Jupyter-Notebook on the data, which are stored on the HPC cluster, you will notice that the remote connection will not be visible on your local machine. So that is where the SSH hopping comes into play.

So after activating the ROOT+Conda environment, you can run it by:

jupyter notebook --no-browser --port=8889

Note that by running this command, you will see a long string token printed on the screen which you are going to need later as described below.

The classic way is to create tunnels:

You open a new terminal window on your local LXG machine:

ssh -N -f -L localhost:8888:localhost:8889 <USERNAME>@lustre.hpc.gsi.de

This creates a tunnel between the local computer and the HPC machine. Note that this tunnel stays there for ever on your local machine. You can see that it is running by doing:

ps ax | grep ssh

If you by mistake create several tunnels, you can kill their processes by entering their corresponding process ID which is printed on the leftmost column:

kill <PID>

So now at this stage you know that Jupyter-Notebook is running on the HPC computer on port 8889 and you have created a SSH-tunnel on your local LXG machine which connects port 8889 of the HPC machine to your local 8888. Now if you open browser on your local machine, you type:

localhost:8888

you can see the Jupyter-Notebook working. You just need to type the token for authentication which was printed on the screen before. This means that now you have access to all analysis files, scripts, ROOT and other libraries from your local machine's browser.

The alternative way is to use ProxyJump:

Single jump:

ssh -L 8888:localhost:8889 <USERNAME>@lx-pool.gsi.de jupyter notebook --no-browser --port=8889

Double jump:

ssh -L 8888:localhost:8889 -o ProxyJump=<USERNAME>@lx-pool.gsi.de <USERNAME>@lustre.hpc.gsi.de jupyter notebook --no-browser --port=8889

then you paste the URL+Token in your browser.

btw. instead of doing analysis inside the browser, I highly recommend using the free VSCodium text editor (free fork of VSCode), which has a super nice integrated interface for all programming languages, LaTeX etc., but also can deal with such remote Jupyter-Servers as mentioned above and also different Python environtments at the same time.

If you need to acticate an environment before, you may need to include it in the SSH command:

ssh -L 8888:localhost:8889 <USERNAME>@lx-pool.gsi.de "conda activate my_root_env; jupyter notebook --no-browser --port=8889"

or

ssh -L 8888:localhost:8889 -o ProxyJump=<USERNAME>@lx-pool.gsi.de <USERNAME>@lustre.hpc.gsi.de "conda activate my_root_env; jupyter notebook --no-browser --port=8889"

Using Windows / PuTTY

Thanks to the free / open source program PuTTY you can repeat the steps above on a Windows machine, but this machine needs to have access to the same network as the HPC, i.e. it should be a GSI device:

like above:

So after activating the ROOT+Conda environment, you can run it by:

jupyter notebook --no-browser --port=8889

Note that by running this command, you will see a long string token printed on the screen which you are going to need later as described below.

  • Open PuTTY, enter the server URL or IP address as the hostname
  • go to SSH on the bottom of the left pane to expand the menu and then click on Tunnels
  • Enter the port number which you want to use to access Jupyter on your local machine, in this case 8888, and set the destination as localhost:8889 where :8889 is the number of the port that Jupyter Notebook is running on.
  • Now click the Add button, and the ports should appear in the Forwarded ports list.
  • Click Open button to open the terminal
  • Open browser, enter localhost:8888 to see the jupyter and then enter the token

A screen shot cof the settings an be found here.

Home office

It is possible to do the analysis from Windows or Mac computers inside our outside of institute.

Mac inside institute:

  • you can use ssh -X

Mac or Windows outside of the institute, or Windows inside of the institute:

  • You can use CITRIX

For that you have to apply for the activation of your CITRIX account by the IT-deparment and follow the instructions on the IT page for installing the CITRIX receiver.

After that, on the CITRIX host, you can directly connect to the HPC cluster using either X2GO or the XWin-32 clients. There is no need to use remote desktop connection to another local windows machine. But you need to hop once over a linux machine as described in the section above, since HPC machines do not seem to be reachable to the receiver machines.

So the same applies here: inside of the CITRIX receiver, you start either X2GO or the XWin-32 clients and connect to lx-pool or your own local LXG machine (if you have the permissions to). From there you do the rest like above. This also applies to SSH hopping and Jupyter-Notebook.

AI and deep learning

TBD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment