Skip to content

Instantly share code, notes, and snippets.

@NickRoz1
Last active August 31, 2020 10:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save NickRoz1/98a6ef0bbf4e6b9d28d52906d0f96df1 to your computer and use it in GitHub Desktop.
Save NickRoz1/98a6ef0bbf4e6b9d28d52906d0f96df1 to your computer and use it in GitHub Desktop.
Final blog

Final Blog

Nick Rozinsky

The code can be found here (new binning types) and here (YODA index file functionality). The documentation accessible by this link.

Abstract

This project goal was to create a flexible binning implementation to be used as a backend in usage specific binning types (histograms, etc.) in YODA data analysis package. The previous implementation hadn't flexible centralized computing machinery behind user facing types, which lead to code duplication.

The new implementation clearly separates concerns of the binning process, providing a flexible backend API which significantly reduces code amount (and, consequently, work amount) needed to build a user facing binning type on top of it (histogram, for example).

New binning machinery is covered with tests and documented.

Main points

The legacy implementation had custom machinery for every user facing type and was not flexible. Significant code duplication complicated the maintenance and code modernization.

New implementation developed to satisfy the following requirements:

  • Binning should handle any number of axes, i.e. multidimensional spaces.
  • Binning should be able to work over heterogeneous axes, i.e. labeled and continuous in one histogram. Heterogenous Binning
  • Binning should work with any desired contained type, i.e. provide customizable behavior on fills.
  • Binning should provide clear backend API to build user facing types upon of it.

The implementation can be separated in three different levels.

The base level of any binning is the axes. Axes contain set of edges, which are then used to determine the bin index where the variable fill landed. There are two types of axes, Discrete Axis, or DAxis, and Continuous Axis, or CAxis. They differ by the underlying indexing machinery. The discrete axes require all fills (coordinates) to be exact match with one of the edges from edge set of DAxis, otherwise an exception thrown. The continuous axes covers the intervals between the edges, e.g. if we have Continuous Axis initialized with edges set {0.0, 1.0, 3.0}, and query the bin index for coordinate 1.4 the result will be 2. This is because there are 4 bins formed by this particular edge set. One is underflow, from -inf to 0.0, the last one is overflow, from 3.0 to +inf.

Together multiple axes form a space which is managed by Binning.

Binning type coordinates access to the underlying axes, providing clear API for querying the bin indices, and performing operations on this space (slicing it, merging it). Binning provides methods for calculation of non overlapping global indices used to query the bins (regions enclosed by or corresponding to (discrete case) axes' edges) in the BinnedStorage. However the Binning does not control the contents of the statistical binning, e.g. the distributions, or counters, or any other custom type which might be held in the statistical binning.

BinnedStorage is the type responsible for access to the bins contents. It contains an array (vector) with bins contents. To instantiate BinnedStorage template, one must provide the BinContentT type, which will be augmented with additional binning specific data (max, min, etc.) and constructed in every BinnedStorage array slot. Later, when queried with coordinates of fill, the Binning will calculate a global index for this particular set of coordinates, and BinnedStorage will indirectly (through a fill adapter, discussed below) pass the fill coordinates to the augmented BinContentT object.

Over the BinnedStorage template one can build their own types with usage specific functionality. For example, the histogram type was reimplemented on the base of BinnedStoragecode.

Features

Most of the flexibility provided by the new implementation comes from significant C++ templates use.

Due to static typed nature of C++ it's not trivial to implement structures which contain or interact with different types.

This section will highlight main features of the new implementation. The details of underlying machinery can be read in the documentation.

Axes

CAxis

There are three main ways to initialize Continuous Axis:

  • Provide complete set of edges.
using EdgeT = double;

std::vector<EdgeT> v1 = { 3.45678, 7.123124, 512.2152515 };

CAxis<EdgeT> axis1(v1);
CAxis<EdgeT> axis2({ 3.45678, 7.123124, 512.2152515 });
  • Provide number of bins, lower and upper limits.
using EdgeT = double;

CAxis<EdgeT> axis1(10, 0.0, 10.0);
  • Provide complete set of edge pairs (bin edges).
using EdgeT = double;

vec<std::pair<EdgeT, EdgeT>> edgePairs = {{0.1, 5.0},
                                          {6.0, 10.0},
                                          {11.0, 15.0}};

CAxis<EdgeT> axis1(edgePairs);

Notice, that in the last example bin edges are not adjacent, e.g. there are gaps on axes. These gaps (from 5.0 to 6.0 and from 10.0 to 11.0) are also treated as bins on axis, but they are hidden. When iterating over bins later the bins (bin slices) which are hidden will be omitted by default.

The Continuous Axis' bins can be merged. This is done to amplify statistical power of binning.

using EdgeT = double;

CAxis<EdgeT> axis1(10, 0.0, 10.0);

std::pair<size_t, size_t> mergeRange = {1, 7};
axis1.mergeBins(mergeRange);

// 10 - 6 (merged) + 2 (overflow)
assert(axis1.size() == 6);
DAxis

Discrete axis can only be initialized with the concrete edges values:

using EdgeT1 = std::string;

std::vector<EdgeT1> edges = {"test1", "test2"};

DAxis<EdgeT1> axis1(edges);
DAxis<EdgeT1> axis2({"test1", "test2"});

using EdgeT2 = int;

DAxis<EdgeT1> axis3({1, 2, 3});

If the coordinate send to get bin index does not match with any of axis' edges, exception std::out_of_range is thrown.

Binning

To initialize Binning one has to pass all underlying axes:

std::vector<std::string> v1 = {"test1", "test2"};
std::vector<double> v2 = {1., 2., 3., 4.};
std::vector<int> v3 = {1, 2, 3, 4, 5};

DAxis<std::string> ax1(v1);
CAxis<double>      ax2(v2);
DAxis<int>         ax3(v3);

Binning<DAxis<std::string>, CAxis<double>, DAxis<int>> binning1(ax1, ax2, ax3);

The main function of Binning is globalBinIndex(), it converts fill coordinates to global non overlapping bin index to be used in BinnedStorage:

binning1.globalBinIndex({"test1", 1.5, 2});

It is possible to merge Binning axes' bins:

binning1.template mergeBins<1>({1, 3});

// Won't compile. Discrete axes are not mergable.
// binning1.template mergeBins<0>({1,3});

// If there are multiple CAxis
// binning1.template mergeBins<1, 2>({1, 3}, {2, 3});

One could slice binning, e.g. request indices of bins located in binning slice, using sliceBinning function:

binning1.sliceBinning(0,0);

/// Convenience overload.
/// {Axis Number, {Bins Numbers...}}
binning1.sliceBinning({{0, {0, 1}}, {1, {2, 3}}});

BinnedStorage

Just like the Axes types, BinnedStorage can be initialized in different ways:

using DStrAxis = DAxis<std::string>;
using DIntAxis = DAxis<int>;
using CDblAxis = CAxis<double>;

using BinningT = Binning<DStrAxis, CDblAxis, DIntAxis>;

auto binning = std::make_shared<BinningT>(
  BinningT({"test1", "test2"}, {1., 2., 3., 4.}, {1, 2, 3, 4, 5}));

DStrAxis ax1({"test1", "test2"});
CDblAxis ax2({1., 2., 3., 4.});
DIntAxis ax3({1, 2, 3, 4, 5});

auto binStorage1 = BinnedStorage<double, BinningT>(binning);
auto binStorage2 = BinnedStorage<double, BinningT>({"test1", "test2"},
                                                    {1., 2., 3., 4.},
                                                    {1, 2, 3, 4, 5});
auto binStorage3 = BinnedStorage<double, BinningT>(ax1, ax2, ax3);

The first template argument of BinnedStorage is a BinContentT type. The bins will be filled with augmented version of it.

Since different user defined types which might be stored in BinnedStorage and act as statistical counters have different APIs, the indirection is needed.

Fill adapter concept is introduced to implement such indirection. Fill adapter coordinates access to the BinContentT API.

If one wants to use user defined type with BinnedStorage, the adapter for this type should be provided:

struct TestClass{
  TestClass(size_t k = 0) : counter(k) {}
  void func() {
    counter++;
    return;
  }
  size_t counter = 0;
};
auto binning = std::make_shared<BinningT>(
  BinningT({"test1", "test2"}, {1., 2., 3., 4.}, {1, 2, 3, 4, 5}));


auto adapter2 = [](auto& customObject, auto&& coords, auto weight, auto fraction) {
  customObject.func();
};

auto binStorage1 = StorageT<double>(binning); /// @note default adapter is used

auto binStorage3 = StorageT<TestClass>(binning, adapter2);

binStorage3.fill({"test1", 1.5, 1}); /// 0 + 1*2 + 0*5*2

StorageT<TestClass>::BinType& bin = binStorage3.getBin(2);

assert(bin.counter == 1);

For the commonly used types, such as double or Dbn, the default fill adapters are implemented.

To add such default adapter, one should add template specialization of defaultAdapter structure — example.

To augment stored types with binning specific information, Bin type is used. It publicly inherits from the stored type, adding additional information, such as space characteristics of bin.

auto binning = std::make_shared<BinningT>(
    BinningT({"test1", "test2"}, {1., 2., 3., 4.}, {1, 2, 3, 4, 5}));

auto binStorage1 = StorageT<double>(binning);

/// @note Querying bin corresponding to {"test2", {1., 2.}, 1}
StorageT<double>::BinType& bin = binStorage1.getBin(3);

// bin.width<2>() // Won't compile: not a continueos axes.

assert(bin.width<1>() == 1.  &&
    bin.min<1>()   == 1.  &&
    bin.max<1>()   == 2.  &&
    bin.mid<1>()   == 1.5);

To allow Bin inherit from fundamental types, such as double, ArithmeticWrapper is implemented. Without such a wrapper, it is not possible to inherit from fundamental types. ArithmeticWrapper exposes API for arithmetic operations, so augmented type Bin<double, BinningT> acts like double, i.e. it is possible to add one Bin to another, divide them, multiple them, etc.

BinnedStorage provides getBins(bool includeOverflows, bool includeHiddenBins) function which returns a wrapped vector, which can be used to iterate over the bins vector inside BinnedStorage. This vector automatically omits overflow or hidden bins slices if required to.

If the bin on one axis is hidden, or is an overflow bin, then all stored Bin objects located in this slice are also hidden. For example on a 2D plane, if some region on X axis is omitted, then all points above this region no matter what Y value they have will also be omitted.

using BinStorageT = BinnedStorage<double, Binning<DAxis<std::string>, CAxis<double>>>;
BinStorageT binnedStorage1({"test1", "test2"}, {1., 2., 3.});

std::vector<size_t> overflowIndices = {0, 1, 6, 7}; /// {0,1} + {0*2, 3*2}

auto binsVec1 = binnedStorage1.getBins();

for(auto& bin : binsVec1) {
  for(auto& i : overflowIndices) {
    assert(bin.getBinIndex() != i);
  }
}

/// @note Double curly braces to construct temporary axes, because
/// there is no constructor defined constructor to work with arguments
/// of std::initializer_list<T> and std::initializer_list<std::initializer_list<T>>
/// type. Intermediate axis is created.
BinStorageT binnedStorage2({{"test1", "test2"}}, {{{1., 2.}, {3., 4.}}});

std::vector<size_t> hiddenBinIndices = {4, 5}; /// {0,1} + 2*2

auto binsVec2 = binnedStorage2.getBins(true);

for(auto& bin : binsVec2) {
  for(auto& i : hiddenBinIndices) {
    assert(bin.getBinIndex() != i);
  }
}

The BinnedStorage also provides merging functionality. It actually merges the stored bins, shrinking the storage vector. It is only available when the BinnedStorage template is instantiated with BinContentT stored type which implements += operator, since it is assumed that the only reason to merge bins is to amplify statistical power, and this is not possible if the merged bins will disappear without transferring collected data to common bin (consisted of the merged bins).

using BinStorageT = BinnedStorage<double, Binning<DAxis<std::string>, CAxis<double>>>;
BinStorageT binnedStorage1({"test1", "test2"}, {1., 2., 3.});
BinStorageT binnedStorage2({"test1", "test2"}, {1., 2., 3., 4., 5., 6.});

// binnedStorage.mergeBins<0>({1,2}); /// Won't compile. Discrete axes are not mergable.
binnedStorage1.mergeBins<1>({1,2});

binnedStorage2.mergeBins<1>({0,4});

assert(binnedStorage1.numBins() == 6 &&
       binnedStorage2.numBins() == 6);

One can fill or set bin.

Setting copies Bin object to the corresponding bin storage vector slot:

auto binning = std::make_shared<BinningT>(
  BinningT({"test1", "test2"}, {1., 2., 3., 4.}, {1, 2, 3, 4, 5}));

auto binStorage = StorageT<TestClass>(binning, [](...){});

binStorage.set({"test1", 1.5, 1}, TestClass(15));

auto& bin = binStorage.getBin(2);

assert(bin.counter == 15);

Filling calls FillAdapter with the object corresponding to the provided coordinate:

auto binning = std::make_shared<BinningT>(
  BinningT({"test1", "test2"}, {1., 2., 3., 4.}, {1, 2, 3, 4, 5}));

auto binStorage = StorageT<Dbn<3>>(binning);

binStorage.fill({"test1", 1.4, 1}, 2.0, 1.0);
binStorage.fill({"test1", 1.6, 1}, 2.0, 1.0);

auto& bin = binStorage.getBin(2);

assert(bin.numEntries() == 2);

Discussion

It is observed on the example of histogram reimplementation that utilizes this centralized binning backend approach that it greatly reduces code duplication, and simplifies code maintenance and modernization.

While the code is certainly more intricate than it was due to wide usage of templates, the overall design is intuitive now.

Acknowledgements

Big thanks to my mentors Louie Dartmoor Corpe and Andy Buckley for all the help and warm attitude :)

Thanks to the CERN-HSF for organizing this event.

Thanks to the Google for hosting this program. It is a great help for Open Source initiatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment