Skip to content

Instantly share code, notes, and snippets.

@eguiraud
Last active June 28, 2017 10:38
Show Gist options
  • Save eguiraud/16a5abd59c153b686d254b22b3207df7 to your computer and use it in GitHub Desktop.
Save eguiraud/16a5abd59c153b686d254b22b3207df7 to your computer and use it in GitHub Desktop.
TDataFrame gists

TDataFrame developer guide, or "why the hell is it implemented like this"

Before delving in this utterly incomplete developers guide make sure you read the users guide here.

Overview

The important objects that are part of the TDataFrame framework are the following:

  • Helpers: all objects in TDFActionHelpers.{hxx,cxx}) are those that actually execute the actions. There is about one helper per possible action. These are full-blown objects because in general they need to store state (e.g. current value of Max of a branch), they must be thread-aware (i.e. they store one partial result per thread/slot) and they must perform finalising operations when the event-loop is terminated (e.g. merge of the partial results).
  • TDataFrameValue: (docs): an abstraction over the different kinds of branch values that the nodes in the TDataFrame framework have to deal with. It stores exactly one object among TTreeReaderValue<T>, TTreeReaderArray<T>, T*. The raw pointer handles the case of a temporary branch. Each node stores a tuple of TDataFrameValues and calls their Get method to access the actual branch value, independently of the underlying type.
  • TDataFrameAction (docs): it represents a leaf of the graph: no other nodes can hang from it. It is a generic entry-point for the Operations to be executed. It fetches the values of the different (concrete or temporary) columns and passes them to the actual operation for each one of the events.
  • TActionResultProxy (docs): a wrapper around the result of an action, that behaves as a smart pointer to it. Upon first invocation of its operator->, it triggers the event-loop that generates the result, then it returns it.
  • TDataFrameFilter (docs): it represents a node of the graph that performs a filtering operation. It is templated over the type of the filter functor and inherits from a non-template TDataFrameFilterBase that allows storage of multiple filters in an stl container.
  • TDataFrameBranch (docs): a node of the graph that is responsible for the lazy evaluation of the values of a temporary column, for each event. (note: this name is a bit misleading. TDFTmpColumn or something along this lines might be more appropriate).
  • TDataFrameRange (docs): a node that acts as a filter but instead of checking a condition on the branch values it returns true or false according to the range specification it implements.
  • TDataFrameImpl (docs): the root node of a graph and the manager of the event loop. In addition, this object is responsible for storing all registered filters, temporary columns, ranges, actions and results, so they remain in scope as long as the TDataFrameImpl does. There is only one TDataFrameImpl per graph. TActionResultProxys trigger TDataFrameImpl::Run when they are accessed for the first time. (note: the name does not reflect the object's function anymore. Maybe something along the lines of TDFPlayer or TDFManager would be better).
  • TDataFrameInterface (docs): this object implements all user-accessible methods (Filter,Define,Range and all actions) and stores a shared_ptr to a node of the graph (see ownership below). Users only ever handle TDataFrameInterfaces (templated over the different kinds of graph nodes) which make it easy to store a node, move it, copy it and use it in whatever ways. Transformations return a new TDataFrameInterface object, actions return a TActionResultProxy (a wrapper for the function result).
  • TDataFrame (docs): a facade for TDataFrameInterface<TDataFrameImpl>, that exposes the right constructors to users and builds a TDataFrameImpl by passing the appropriate arguments.

Ownership

Each functional graph consists in one TDataFrameImpl instance from which all other nodes hang and which registers and has shared ownership of all filters, temporary columns, actions, etc. The ownership is shared with the TDataFrameInterface that encapsulates each particular graph node, so that in a call such as

auto c = df.Filter(...).Filter(...).Count()

The intermediate TDFInterface<TDFFilter> can be used to chain transformations and then be safely destroyed: the TDataFrameFilter objects are safely stored inside the TDataFrameImpl.

Each node stores a reference to the previous node and a reference to the TDataFrameImpl. Both could become dangling in case the TDFInterface<TDataFrameImpl> goes out of scope. This is why each TDFInterface stores a weak_ptr to the TDataFrameImpl and checks it's not expired before performing any action: so any usage of these potentially dangling references are protected by a check on the validity of TDataFrameImpl. Same goes for TActionResultProxys, which check the TDataFrameImpl is still in scope before triggering its event loop.

Usage of shared and weak pointers inside the event-loop is strongly discouraged as the thread safety guarantee of shared pointers has a cost in terms of performance of the hot loop.

Finally, each function object stores a tuple of TDataFrameValues, which abstract the differences in handling actual TTree with TTreeReaderValues/TTreeReaderArrays or temporary columns. Each TDataFrameValue contains a unique_ptr to a TTreeReaderValue, a unique_ptr to a TTreeReaderArray and a raw pointer (non-owning, read-only) to a temporary column value. Only one of these three will be non-null for each TDataFrameValue, but which one must be decided at runtime.

Action methods

After the execution of an action method in TDataFrameInterface, the following things should have happened:

  • the caller must have received a TActionResultProxy<ResultType> for lazy actions, or the result of the call for instant actions (n.b. at the time of writing the only instant action is Foreach, which returns nothing)
  • the TDataFrameImpl must have registered the corresponding "readiness pointer" (handled by MakeActionResultProxy)
  • the TDataFrameImpl must also have registered the TDataFrameAction object the encapsulates the Operation to be executed by the action
  • the TDataFrameInterface object on which the action method is called should execute fProxiedPtr->IncrChildrenCount to increase the count of children nodes for the graph node it encapsulates

The simplest example of implementation of an action method is probably Min: it calls CreateAction almost immediately, which has two different overloads depending on whether the branch type has been explicitly specified as template parameter to the Min call itself or has been omitted.

No branch type inference

If the branch type has been explicitly specified, CreateAction calls BuildAndBook with the right ActionType parameter, which in turn takes care of building the TDataFrameAction object, booking it with TDataFrameImpl and return the TActionResultProxy.

Branch type inference

If the branch type has not been explicitly specified, it defaults to TDataFrameGuessedType, and the relevant overload of CreateAction is called. This overload calls CreateActionGuessed, which is responsible for jitting and executing the actual action method, with all types explicitly specified, which performs as described in the former paragraph.

The event-loop

The event-loop is started whenever a TActionResultProxy calls TriggerRun, or an instant action is called. In both cases TDataFrameImpl::Run is called (after checking that the TDataFrameImpl object is still in scope), which is the method responsible of running the loop.

If EnableImplicitMT has not been called, a "normal" TTreeReader is build and the loop is wrapped by the usual while(reader.Next()). Otherwise a TTreeProcessorMT takes responsibility of running the loop and distributing entries to several workers. In both cases the logic executed inside the event loop is the same.

What actually happens is something similar to the following pseudo-code:

foreach(event)
   foreach(booked_action)
      booked_action.exec()
   foreach(named_filter)
      named_filter.check_filter()

The first two lines make sure that all actions are executed. Each action, in turn, calls CheckFilters on the previous node, which calls CheckFilters on the previous one, etc, in a chain of calls that ends at the root of the functional graph (i.e. TDataFrameImpl, whose CheckFilters method always returns true, or stops early if any of the filters returns false).

The last two lines make sure that all named filters are checked for each event, even if they would not be checked otherwise (e.g. because a downstream filter already returns false), because named filters are required to count the number of events that pass and not pass that check.

Clean-up

After the event-loop has finished running, the TDataFrameImpl clears its internal lists of all actions, sets the "readiness" values of all TActionResultProxies to true, and forgets the TActionResultProxies as well, since they will never be needed again (an action is never executed twice, and after the event-loop has run users are guaranteed that TDataFrame will not modify the results again). Filters and temporary columns are not forgotten, as they might be re-used in subsequent runs of the event-loop (they are not leaves in the functional graph).

How does this compare with TTree::Draw?

TDataFrame can do more than TTree::Draw, not just producing plots; in fact, with TDataFrame one can do everything that can be done by running the event loop explicitly using TTreeReader or a TSelector. It also allows to perform several actions in the same event loop, while TTree::Draw loops once for every call to it. Finally, TDataFrame has the advantage of using compiled c++ expressions as cuts and other operations, which make it both safer to use and more powerful.

On the other hand, TTree::Draw allows the use of a domain-specific language to specify queries, which lets users express certain operations (e.g. loops over collections) in a very compact manner.

In terms of raw performance, the first, admittedly preliminary benchmark shows same speed as TTree::Draw for the filling of one histogram reading a branch containing a complex object and using one core. TDataFrame performs better than TTree::Draw as soon as implicit multi-threading is enable or multiple histograms are filled (TDataFrame fills them in the same event-loop while TTree::Draw loops once per histogram).

What if I want to be in control of what happens in the event loop?

The Foreach and ForeachSlot actions allow you to completely specify what is done for each event in the tree. As a bonus, executing the event-loop in parallel on multiple threads remains straightforward, provided the functions executed are thread-safe.

Why is there no explicit Run method to start the event loop precisely when I want to?

This (explicit or implicit triggering of the event loop) was not an easy choice. There are pros and cons to both approaches. An explicit Run would allow users to be able to use action results before they are filled/finalised by the TDataFrame, which we felt might have been cause of confusion. Moreover, it would require users to remember to call Run, and to book-keep which action results have been filled at a given point of the execution. An implicit triggering solves all the issues above, while still allowing expert users to manage the triggering at will: just dereference an action result, and that is your Run method right there.

Why can't I use ranges with ImplicitMT enabled?

Because it would be convoluted and non-performant to make them behave in a sensible way in a multi-threaded environment:

  • we would need extra synchronization points, possibly very fine grained ones, to keep track of how many entries have been processed so far globally
  • we would need to insert very invasive synchronized checks in all tasks to make sure that as soon as the requested amount of entries have been processed, all tasks quit early; or otherwise get the total processed number of entries wrong by the order of magnitute of one cluster (possibly tens of thousands of entries) Basically the only request we could respect in a multi-threaded analysis is the step parameter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment