The IRIS-HEP Talk - Memo
Hello everyone, this is Nino. In this summer, I'm working on building a histogramming tool for High Energy Physics, which is part of the Scikit-HEP project in IRIS-HEP. Then let me give a brief introduction about my work.
The Scikit-HEP is a community-driven project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python. One of the most amazing tools in Scikit-HEP is boost-histogram, which is the Python binding for the C++14 Boost::Histogram Library. This is of the fastest libraries for histogramming, while still providing the power of a full histogram object and supporting various types of histogramming manipulations. You can see the new features of boost-histogram on its website.
Based on boost-histogram, we implemented an analysis-friendly package, hist, built on top of boost-histogram, and provided commonly expected histogramming features. This package allowed a wider range of dependencies and features that not admissible in the core package and can be installed from pip or conda.
For example, you can simply use our package like the snippet on the left, and visualize the results like the right parts, according to the properties of the histograms. For example, this is a 2d-hist, and this is a 1d-hist, and these are svg_repr.
I am going to show the new features of hist in more detail by showing you a notebook.
Considering that it takes some time to build the demo notebook on binder, so I just run it locally here. For you, you can run the demo notebook on the binder by clicking this button.
I need to restart the kernel before I start. Okay, Here we go!
Before showing how the new features work, I need to clarify the relationship and differences between boost-histogram and hist.
The BH library should be viewed like NumPy; a powerful, fundamental library for supporting histograms. However, it has no dependencies and is designed to be exact. It is meant to be usable as a backend for other histogramming libraries.
Like Pandas provides a nice interface to NumPy that adds columns, plotting adapters, and more, Hist is an analyst friendly frontend to boost-histogram.
The differences of them are shown as below. Generally, If a feature doesn't add dependencies and is useful/popular, it may be upstreamed to boost-histogram.
Hist currently provides 4 things that boost-histogram doesn't have.
Names, UHI+, Plotting, Quick Constructors
Let's take a closer look at them through some code snippets
Hist assigns meanings to the metadata via names and labels.
Axis in Hist can have a name
and a label
. The name is special; it needs to be unique to a Histogram, and is completely optional - you can even mix named and unnamed axes.
A name
can be used to identify an axis anywhere an index can be used! For example, here we construct a Hist instance with the name of x and y, and then fill it via their names and then we can use matplotlib-hep to plot it. Furthermore, we can also use names to access the contents and modify the labels at any time. Note that as the name is the only identifier to the axes, so they cannot be changed after created and they are unique. If you love names, there's even an experimental NamedHist
that enforces only named access - you can't use normal positional access on a NamedHist
.
The second important feature I want to show is UHI, which stands for … Hist supports an experimental UHI addition that allows for ultra-terse UHI. You can perform all manipulations directly inline without extra imports.
Well, the rules for single bin or slice endpoints is: you can use an non-negative number such as 3 to access the contents at bin coordinate 3; and you can use a complex number such as 1.5j to access the contents at data coordinate 1.5. Similarly, for StrCategory axis, you can use strings to access directly.
There are also some action slots. For example, you can use sum just like you did in python build-in grammar. And you can also use complex to represent the shrink scale to rebin the axis.
Let's see an interesting example here.
First we use a python icon to create the hit probability data as X and Y and fill them into a histogram named py. Then we can plot it
and zoom it in.
WE can also rebin and as you can see, it becomes brighter because more hits will fall in one bin.
The same principle applies to the StrCategory.
I assume you must be interested at the plot above. Yep, we use our own plotting methods and those are also an amazing new feature of hist.
.plot method will automaticlly give you a plot according to the dimension of the histogram. In this case, we have a 2d -histogram and we can get a color mesh. And if the histogram is 1d, you will get a line.
If you want to get the projection of a 2d-histogram, you can use .plot2d_full, and you can also use your costomized keywords.
In the end, I would like to introduce hist's quick construction system. You don't need to use hist namespace everywhere like you do in BH. Hist supports a very experimental quick construction system. You can create and transform an Histogram like this, and even specify its storage type. like this, umm, Hist, Reg axis, with 10 bins, from -3 to 3, in the Int64 Storage, fill with numpy.random
Note that you cannot give storage or axis to an existing histogram like this. See,
Okay, I think we can get back to the PDF for now.
To sum up, Hist enhances the boost-histogram in all parts of the lifecycle of a histogram.
- Hist allows users to serialize and deserialize BH's from RooT and Pickle format, and makes sure Scikit-HEP tools to communicate well.
- Hist provides more convinient methods for initialize histograms, you can use Axis, Storage, and Transform Proxy for quick initialization. Moreover, two types of histograms are given for different usage,
Hist
is the general one and NamedHist
is name-oriented.
- Hist gives users more choices to do some manipulations. Users can use axis names to fill, project, and access. Plus, complex numbers are also allowed in hist to access bins.
- Some of the functions of hist facilitate several statistics tools in Scipy and iminuit to analysis data.
- Hist offers amazing visualization tools to show the histograms of one or two dimensions. Furthermore, SVG repr format is allowed in hist.
If you want to know more about them, you can refer to our documentation.
We provide some resources on this page to help you better understand this tool. Most of which are useful scikit-hep tools.
In the end, I want to say, ang, currently, we have very few users, but some people have helped us to find bugs and given us useful suggestions. We are looking forward to your contributions and cherish your efforts. Hope you can join us. Though GSOC is ended, I will still keep track on this project and contribute to it. A new version of hist is around the corner and could be published in this week hopefully. Thanks for your listening!