Before delving in this utterly incomplete developers guide make sure you read the users guide here.
The important objects that are part of the TDataFrame framework are the following:
- Helpers: all objects in
TDFActionHelpers.{hxx,cxx}
) are those that actually execute the actions. There is about one helper per possible action. These are full-blown objects because in general they need to store state (e.g. current value ofMax
of a branch), they must be thread-aware (i.e. they store one partial result per thread/slot) and they must perform finalising operations when the event-loop is terminated (e.g. merge of the partial results). - TDataFrameValue: (docs): an abstraction over the different kinds of branch values that the nodes in the TDataFrame framework have to deal with. It stores exactly one object among
TTreeReaderValue<T>
,TTreeReaderArray<T>
,T*
. The raw pointer handles the case of a temporary branch. Each node stores a tuple ofTDataFrameValues
and calls theirGet
method to access the actual branch value, independently of the underlying type. - TDataFrameAction (docs): it represents a leaf of the graph: no other nodes can hang from it. It is a generic entry-point for the
Operations
to be executed. It fetches the values of the different (concrete or temporary) columns and passes them to the actual operation for each one of the events. - TActionResultProxy (docs): a wrapper around the result of an action, that behaves as a smart pointer to it. Upon first invocation of its
operator->
, it triggers the event-loop that generates the result, then it returns it. - TDataFrameFilter (docs): it represents a node of the graph that performs a filtering operation. It is templated over the type of the filter functor and inherits from a non-template
TDataFrameFilterBase
that allows storage of multiple filters in an stl container. - TDataFrameBranch (docs): a node of the graph that is responsible for the lazy evaluation of the values of a temporary column, for each event. (note: this name is a bit misleading.
TDFTmpColumn
or something along this lines might be more appropriate). - TDataFrameRange (docs): a node that acts as a filter but instead of checking a condition on the branch values it returns
true
orfalse
according to the range specification it implements. - TDataFrameImpl (docs): the root node of a graph and the manager of the event loop. In addition, this object is responsible for storing all registered filters, temporary columns, ranges, actions and results, so they remain in scope as long as the
TDataFrameImpl
does. There is only oneTDataFrameImpl
per graph.TActionResultProxy
s triggerTDataFrameImpl::Run
when they are accessed for the first time. (note: the name does not reflect the object's function anymore. Maybe something along the lines ofTDFPlayer
orTDFManager
would be better). - TDataFrameInterface (docs): this object implements all user-accessible methods (
Filter
,Define
,Range
and all actions) and stores a shared_ptr to a node of the graph (see ownership below). Users only ever handleTDataFrameInterface
s (templated over the different kinds of graph nodes) which make it easy to store a node, move it, copy it and use it in whatever ways. Transformations return a newTDataFrameInterface
object, actions return aTActionResultProxy
(a wrapper for the function result). - TDataFrame (docs): a facade for
TDataFrameInterface<TDataFrameImpl>
, that exposes the right constructors to users and builds aTDataFrameImpl
by passing the appropriate arguments.
Each functional graph consists in one TDataFrameImpl
instance from which all other nodes hang and which registers and has shared ownership of all filters, temporary columns, actions, etc. The ownership is shared with the TDataFrameInterface
that encapsulates each particular graph node, so that in a call such as
auto c = df.Filter(...).Filter(...).Count()
The intermediate TDFInterface<TDFFilter>
can be used to chain transformations and then be safely destroyed: the TDataFrameFilter
objects are safely stored inside the TDataFrameImpl
.
Each node stores a reference to the previous node and a reference to the TDataFrameImpl. Both could become dangling in case the TDFInterface<TDataFrameImpl>
goes out of scope. This is why each TDFInterface stores a weak_ptr to the TDataFrameImpl
and checks it's not expired before performing any action: so any usage of these potentially dangling references are protected by a check on the validity of TDataFrameImpl
. Same goes for TActionResultProxy
s, which check the TDataFrameImpl
is still in scope before triggering its event loop.
Usage of shared and weak pointers inside the event-loop is strongly discouraged as the thread safety guarantee of shared pointers has a cost in terms of performance of the hot loop.
Finally, each function object stores a tuple of TDataFrameValue
s, which abstract the differences in handling actual TTree with TTreeReaderValue
s/TTreeReaderArray
s or temporary columns. Each TDataFrameValue
contains a unique_ptr to a TTreeReaderValue
, a unique_ptr to a TTreeReaderArray
and a raw pointer (non-owning, read-only) to a temporary column value. Only one of these three will be non-null for each TDataFrameValue
, but which one must be decided at runtime.
After the execution of an action method in TDataFrameInterface
, the following things should have happened:
- the caller must have received a
TActionResultProxy<ResultType>
for lazy actions, or the result of the call for instant actions (n.b. at the time of writing the only instant action isForeach
, which returns nothing) - the
TDataFrameImpl
must have registered the corresponding "readiness pointer" (handled byMakeActionResultProxy
) - the
TDataFrameImpl
must also have registered theTDataFrameAction
object the encapsulates theOperation
to be executed by the action - the
TDataFrameInterface
object on which the action method is called should executefProxiedPtr->IncrChildrenCount
to increase the count of children nodes for the graph node it encapsulates
The simplest example of implementation of an action method is probably Min
: it calls CreateAction
almost immediately, which has two different overloads depending on whether the branch type has been explicitly specified as template parameter to the Min
call itself or has been omitted.
If the branch type has been explicitly specified, CreateAction
calls BuildAndBook
with the right ActionType
parameter, which in turn takes care of building the TDataFrameAction
object, booking it with TDataFrameImpl
and return the TActionResultProxy
.
If the branch type has not been explicitly specified, it defaults to TDataFrameGuessedType
, and the relevant overload of CreateAction
is called. This overload calls CreateActionGuessed
, which is responsible for jitting and executing the actual action method, with all types explicitly specified, which performs as described in the former paragraph.
The event-loop is started whenever a TActionResultProxy
calls TriggerRun
, or an instant action is called. In both cases TDataFrameImpl::Run
is called (after checking that the TDataFrameImpl
object is still in scope), which is the method responsible of running the loop.
If EnableImplicitMT
has not been called, a "normal" TTreeReader
is build and the loop is wrapped by the usual while(reader.Next())
. Otherwise a TTreeProcessorMT
takes responsibility of running the loop and distributing entries to several workers. In both cases the logic executed inside the event loop is the same.
What actually happens is something similar to the following pseudo-code:
foreach(event)
foreach(booked_action)
booked_action.exec()
foreach(named_filter)
named_filter.check_filter()
The first two lines make sure that all actions are executed. Each action, in turn, calls CheckFilters
on the previous node, which calls CheckFilters
on the previous one, etc, in a chain of calls that ends at the root of the functional graph (i.e. TDataFrameImpl
, whose CheckFilters
method always returns true
, or stops early if any of the filters returns false
).
The last two lines make sure that all named filters are checked for each event, even if they would not be checked otherwise (e.g. because a downstream filter already returns false), because named filters are required to count the number of events that pass and not pass that check.
After the event-loop has finished running, the TDataFrameImpl
clears its internal lists of all actions, sets the "readiness" values of all TActionResultProxies
to true
, and forgets the TActionResultProxies
as well, since they will never be needed again (an action is never executed twice, and after the event-loop has run users are guaranteed that TDataFrame
will not modify the results again). Filters and temporary columns are not forgotten, as they might be re-used in subsequent runs of the event-loop (they are not leaves in the functional graph).