Skip to content

Instantly share code, notes, and snippets.

@harshal0815
Last active October 23, 2022 06:48
Show Gist options
  • Save harshal0815/b593db68f8d9e79407746f0c463206e5 to your computer and use it in GitHub Desktop.
Save harshal0815/b593db68f8d9e79407746f0c463206e5 to your computer and use it in GitHub Desktop.
GSoC_22

Overview

Toolkit for Multivariate Analysis (TMVA) is a multi-purpose machine learning toolkit integrated into the ROOT scientific software framework. It comes with an automatically generated Python interface, which closely follows the C++ interface. The goal of this project is to enhance the Python interface to make it more “pythonic”, i.e. easier to use. This project aims to simplify complex workflows and enhancement of the python interface, greatly reducing the amount of code that has to be written, including pythonizations for TMVA GUI and Hist functions and converters for PyROOT NumPy arrays to convert RTensor from and to PyROOT NumPy arrays.

To make it easier to use ROOT from Python, or to use a more pythonic syntax, PyROOT provides many pythonizations for ROOT classes. A pythonization is a piece of code that injects some new behavior in a ROOT class, e.g. to add new methods, to make the class iterable from Python, or override arithmetic operators. Pythonizations can be implemented in Python or C++ (via the Python/C API). Automatic binding generation mostly gets the job done, but unless a C++ library was designed with expressiveness and interactivity in mind, using it will feel stilted. Thus, it is beneficial to implement pythonizations. Some of these are already provided by default, e.g. for STL containers.

Since bound C++ entities are fully functional Python ones, pythonization can be done explicitly in an end-user facing Python module. However, that would prevent the lazy installation of pythonizations, so instead a callback mechanism is provided. Pythonization in PyRoot -

  • Automatic, dynamic with no static wrapper generation
  • Dynamic python proxies for C++ entities
  • Lazy class/variable lookup
  • Can access all the ROOT C++ functionality from Python

Most of the external machine learning libraries will accept (or expect) a collection of Numpy arrays as the input dataset, either for training or testing. It is possible to seamlessly export data stored in ROOT files (e.g. as a TTree ) into Numpy arrays through RDataFrame.

Project Goals

The objectives of this project are:

  • To understand more about ROOT and TMVA, the various classes, functions, objects, and commands that are used for machine learning.
  • To Pythonize major TMVA commands, methods and enhance the Python interface to make it more “pythonic”, i.e. easier to use and implement Python functions, which will reduce the amount of code that users have to write and greatly simplify complex workflows that could not be implemented without strong expertise in C++ ROOT before.
  • To develop a Pythonization of the TMVA method configuration, using similar code already developed for the RooFit Pythonization project
  • To improve the interface of the TMVA workflow for both training and inference, with the possibility to pass directly Numpy collections such as Python arrays.
  • To add functions to create Histograms objects (TH1, TH2, TH3) from NumPy arrays and add functions to create TGraph's object from NumPy. To add converters in PYROOT NumPy arrays to RTensor objects to convert RTensor from and to Numpy arrays.
  • To integrate the new developments in the variable plotter inside TMVA and provide a Python interface to the TMVA GUI.
  • To create tests and tutorial examples, including updating or translating the existing tutorials according to the pythonizations developed during this project.

Sample code illustrating TMVA pythonization:

from ._utils import _kwargs_to_tmva_cmdargs, cpp_signature


class TMVAClass(object):
@cpp_signature("")
def __init__(self, *args, **kwargs):
   # Redefinition of `TMVAClass` constructor for keyword arguments.
   args, kwargs = _kwargs_to_tmva_cmdargs(*args, **kwargs)
   self._init(*args, **kwargs)


def TMVAMethod(self, *args, **kwargs):
    # TMVA::ClassMethod() function is pythonized with keyword CmdArg pythonization.
    args, kwargs = _kwargs_to_tmva_cmdargs(*args, **kwargs)
    return self._TMVAMethod(*args, **kwargs)

Outcomes

  • Developed Pythonizations of the TMVA method configuration, for following methods and constructors
- TMVA::Factory constructor
- TMVA::Factory::BookMethod
- TMVA::DataLoader::PrepareTrainingAndTestTree
- TMVA::CrossValidation constructor
- TMVA::Envelope::BookMethod
  • Added functions to create Histogram objects from numpy arrays for TH1, TH2, TH3 and to retrieve Numpy arrays from histograms, returning the bin content and optionally the bin sum of weight square.
- FromNumpy()
- GetAsNumpy()
- GetErrors()
- GetBinEdges()
  • Added functions to create objects from numpy using the Constructor directly for TGraph Classes.
- TGraph
- TGraphErrors
- TGraphAsymErrors
- TGraph2D 
- TGraph2DErrors
- TGraph2DAsymErrors
  • Fuctions to retrieve content or errors from the NumPy Object.
- GetX() GetY() Get()
- GetErrorX() GetErrorY() GetErrors() 

Pull Requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment