Skip to content

Instantly share code, notes, and snippets.

@LovelyBuggies
Last active September 25, 2020 01:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save LovelyBuggies/390fef467f319aefdcf95705390d7f78 to your computer and use it in GitHub Desktop.
Save LovelyBuggies/390fef467f319aefdcf95705390d7f78 to your computer and use it in GitHub Desktop.
Gist for Google Summer of Code 2020.

Proposal for GSoC 2020

Hist: histogramming for analysis powered by boost-histogram

The Scikit-HEP project is a collection of several dozen packages intended to facilitate the use of Python in High Energy Physics. One of the major fronts of development is in histogramming; a majority of HEP analysis is heavily reliant on histograms. To this end, a new Python package was introduced for histogramming in Scikit-HEP, boost-histogram. This package is intended to be a core package for histogramming with no dependencies. In this summer, we are preparing to implement an analysis-friendly package, called “hist”, built on top of boost-histogram and providing commonly expected histogramming features. This package will allow a wider range of dependencies and features not admissible in the core package.

Proposal: https://docs.google.com/document/d/1QfwO6PXVz7bQIw5U4qA5DjkQOPWn-6r4vkZIZskAuvE/edit?usp=sharing

Summary for GSoC 2020

Hist: histogramming for analysis powered by boost-histogram

About Hist

Code | Docs

Brief Description

The Scikit-HEP project is a collection of several dozen packages intended to facilitate the use of Python in High Energy Physics. One of the major fronts of development is in histogramming; a majority of HEP analysis is heavily reliant on histograms. To this end, a new Python package was introduced for histogramming in Scikit-HEP, boost-histogram (BH), which is intended to be a core package for histogramming with no dependencies. In this summer, we started to implement an analysis-friendly package hist, built on top of BH and provided commonly expected histogramming features. This package allowed a wider range of dependencies and features not admissible in the core package.

New Features

As BH is a non-dependency wheel, we enhance it for more convinient usage. Specifically, in the lifecycle of a histogram,

  1. Hist allows users to serialize and deserialize BH's histograms from .root, .pkl format, and makes sure Scikit-HEP tools to communicate well.
  2. Hist provides more convinient methods for initialize histograms, you can use Axis, Storage, and Transform Proxy for initialization. Moreover, two types of histograms are given for different usage, Hist is the general one and NamedHist is name-oriented.
  3. Hist give users more choices to do some manipulations. Users can use axis names to fill, project, and access. Plus, complex numbers are also allowed in hist to access bins.
  4. Some of the functions of hist facilitate several statistics tools in Scipy and iminuit to analysis data.
  5. Hist offers amazing visualization tools to show the histograms of one or two dimensions. Furthermore, SVG repr format is allowed in hist.

All of these features are under careful discussions and considerations, and are tested thoroughly. And we are confidnet that users will benefit from them.

Basic Usage

This simple demo briefly shows how to use hist.

import hist

# You can create a histogram like this.
h = (
  hist.Hist()
  .Reg(10, 0 ,1, name="x", label="x-axis")
  .Variable(range(10), name="y", label="y-axis")
  .Int64()
)

# Filling by names is allowed in hist.
hist.fill(y=[1, 4, 6], x=[3, 5, 2])

# New ways to manipulate the histogram.
h.project("x")
h[{"y": 1j + 3, "x": 5j}]
...

# Elegant plotting functions.
h.plot()
h.plot2d_full()
h.plot_pull(Callable)
...

Future Works

Though working hard, there is still room for further improvement:

  • We will include more statistic tools from Scikit-HEP system, e.g., interpolation.
  • We will add hist into Scikit-HEP wheel when it's more mature, i.e., users can use from skhep.hist import ....
  • We will paramize the tests to make it more concise and overarching.
  • We will better the existing codes, and keep finding the bugs.
  • We will perfect the community of hist, making a Stackoverflow tag, gitter channel, etc.
  • We will use Qt to build a front-end for hist.

GSoC Milestones

Note that milestones might be deprecated over time (could be removed in the future), releases are relatively stable.

Project Deliverables

Version 2.0.0 (UPCOMING)

python -m pip install hist

Big Events

Others

Acknowledgements

First, I would like to thank Google Summer of Code to provide me such an amazing opportunity to work with professional and responsible mentors. Second, I would like to express particular thanks to my mentor Henry and Jim. We communicate and discuss hist's development almost everyday to make it a better project. In the end, I would like to thank National Science Foundation to support for this work.

Hist: Histogramming for Analysis powered by Boost-histogram

The IRIS-HEP Talk - Memo

Hello everyone, this is Nino. In this summer, I'm working on building a histogramming tool for High Energy Physics, which is part of the Scikit-HEP project in IRIS-HEP. Then let me give a brief introduction about my work.


The Scikit-HEP is a community-driven project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python. One of the most amazing tools in Scikit-HEP is boost-histogram, which is the Python binding for the C++14 Boost::Histogram Library. This is of the fastest libraries for histogramming, while still providing the power of a full histogram object and supporting various types of histogramming manipulations. You can see the new features of boost-histogram on its website.


Based on boost-histogram, we implemented an analysis-friendly package, hist, built on top of boost-histogram, and provided commonly expected histogramming features. This package allowed a wider range of dependencies and features that not admissible in the core package and can be installed from pip or conda.

For example, you can simply use our package like the snippet on the left, and visualize the results like the right parts, according to the properties of the histograms. For example, this is a 2d-hist, and this is a 1d-hist, and these are svg_repr.


I am going to show the new features of hist in more detail by showing you a notebook.

Considering that it takes some time to build the demo notebook on binder, so I just run it locally here. For you, you can run the demo notebook on the binder by clicking this button.

I need to restart the kernel before I start. Okay, Here we go!

Before showing how the new features work, I need to clarify the relationship and differences between boost-histogram and hist.

The BH library should be viewed like NumPy; a powerful, fundamental library for supporting histograms. However, it has no dependencies and is designed to be exact. It is meant to be usable as a backend for other histogramming libraries.

Like Pandas provides a nice interface to NumPy that adds columns, plotting adapters, and more, Hist is an analyst friendly frontend to boost-histogram.

The differences of them are shown as below. Generally, If a feature doesn't add dependencies and is useful/popular, it may be upstreamed to boost-histogram.

Hist currently provides 4 things that boost-histogram doesn't have.

Names, UHI+, Plotting, Quick Constructors

Let's take a closer look at them through some code snippets

1

Hist assigns meanings to the metadata via names and labels.

Axis in Hist can have a name and a label. The name is special; it needs to be unique to a Histogram, and is completely optional - you can even mix named and unnamed axes.

A name can be used to identify an axis anywhere an index can be used! For example, here we construct a Hist instance with the name of x and y, and then fill it via their names and then we can use matplotlib-hep to plot it. Furthermore, we can also use names to access the contents and modify the labels at any time. Note that as the name is the only identifier to the axes, so they cannot be changed after created and they are unique. If you love names, there's even an experimental NamedHist that enforces only named access - you can't use normal positional access on a NamedHist.

2

The second important feature I want to show is UHI, which stands for … Hist supports an experimental UHI addition that allows for ultra-terse UHI. You can perform all manipulations directly inline without extra imports.

Well, the rules for single bin or slice endpoints is: you can use an non-negative number such as 3 to access the contents at bin coordinate 3; and you can use a complex number such as 1.5j to access the contents at data coordinate 1.5. Similarly, for StrCategory axis, you can use strings to access directly.

There are also some action slots. For example, you can use sum just like you did in python build-in grammar. And you can also use complex to represent the shrink scale to rebin the axis.

Let's see an interesting example here.

First we use a python icon to create the hit probability data as X and Y and fill them into a histogram named py. Then we can plot it

and zoom it in.

WE can also rebin and as you can see, it becomes brighter because more hits will fall in one bin.

The same principle applies to the StrCategory.

3

I assume you must be interested at the plot above. Yep, we use our own plotting methods and those are also an amazing new feature of hist.

.plot method will automaticlly give you a plot according to the dimension of the histogram. In this case, we have a 2d -histogram and we can get a color mesh. And if the histogram is 1d, you will get a line.

If you want to get the projection of a 2d-histogram, you can use .plot2d_full, and you can also use your costomized keywords.

4

In the end, I would like to introduce hist's quick construction system. You don't need to use hist namespace everywhere like you do in BH. Hist supports a very experimental quick construction system. You can create and transform an Histogram like this, and even specify its storage type. like this, umm, Hist, Reg axis, with 10 bins, from -3 to 3, in the Int64 Storage, fill with numpy.random

Note that you cannot give storage or axis to an existing histogram like this. See,

Okay, I think we can get back to the PDF for now.


To sum up, Hist enhances the boost-histogram in all parts of the lifecycle of a histogram.

  1. Hist allows users to serialize and deserialize BH's from RooT and Pickle format, and makes sure Scikit-HEP tools to communicate well.
  2. Hist provides more convinient methods for initialize histograms, you can use Axis, Storage, and Transform Proxy for quick initialization. Moreover, two types of histograms are given for different usage, Hist is the general one and NamedHist is name-oriented.
  3. Hist gives users more choices to do some manipulations. Users can use axis names to fill, project, and access. Plus, complex numbers are also allowed in hist to access bins.
  4. Some of the functions of hist facilitate several statistics tools in Scipy and iminuit to analysis data.
  5. Hist offers amazing visualization tools to show the histograms of one or two dimensions. Furthermore, SVG repr format is allowed in hist.

If you want to know more about them, you can refer to our documentation.


We provide some resources on this page to help you better understand this tool. Most of which are useful scikit-hep tools.


In the end, I want to say, ang, currently, we have very few users, but some people have helped us to find bugs and given us useful suggestions. We are looking forward to your contributions and cherish your efforts. Hope you can join us. Though GSOC is ended, I will still keep track on this project and contribute to it. A new version of hist is around the corner and could be published in this week hopefully. Thanks for your listening!


This file has been truncated, but you can view the full file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment