joshuafcole/Aggregation.md

## Aggregation.md

      
    Raw
  

              Aggregation.md
            
          
    Aggregation Statification Station Vacation III: Tokyo Drift

Misc. Notes


Only aggregates require stratification


We want to aim for the fewest strata possible


Aggregates that look the same should always have the same output


Aggregates which depend on their own outputs (directly or otherwise) are unorderable


Aggregates which filter (rather than provide) the rows contributing to their inputs are orderable


Aggregates have four kinds of variables
output = aggregate[input, given: projection, per: grouping]

output - Must not be bound at the time the aggregate runs

We can relax this through post-filtering


input - Must be fully satisfied before the aggregate's value is correct

We can relax this slightly in the case of grouped aggregates


projection - Indicate additional uniqueness over the input rows.

These have identical properties as inputs


grouping - Indicate the output cardinality of the aggregate

These don't have any special satisfaction requirements


To compute a single aggregate:


If an aggregate's output is bound, it may be decomposed into an aggregate with an unbound output and a post-filter (similar to expressions)
Conceptually, an aggregate maps a set of rows indicated by the projection to a single prefix row indicated by its grouping
Since the aggregate runs once per unique grouping prefix, the grouped variables do not need to be fully satisfied prior to execution
Aggregate output will only be correct when the entire set of unique rows for its projection have been stabilized

This means that the cardinality of the terms in the projection must not change after this point
Ungrouped aggregates run over the empty prefix
These rows are the suffix set of unique rows for one of its groupings
If the rest of the evaluator is incremental, you can just keep issuing updates, but that's pretty expensive


Process


Block until every variable it is grouped on has been satisfied for a single prefix
- Other variables may also be in this prefix, in which case the aggregate will be run multiple times for the same input. Feel free to get clever here
Accumulate all rows that are suffixes of this prefix (easily done by evaluating the aggregate on its trailing edge)
Project those down by the aggregate's projection
Evaluate
Satisfy the aggregate output variable

To stratify a query:


Stratification on aggregates can be viewed as a telescoping generic join
The prefix portion of the join has a cardinality of C X #GROUPS
The suffix portion of the join has a cardinality of K X #PROJECTED per group
To ensure that aggregates always end up operating over the same input, we must run them as early as possible


Satisfy the cheapest variable of the row, adding it to the prefix
Find all rows that match the new prefix
For each row, repeat from 1
If satisfying this variable completely fills the grouping of an aggregate:
A. Compute it
B. Satisfy its output variable
C. Repeat from 4
D. If the aggregate output variable was initially bound, post-filter
E. Repeat from 1

Other stuff


I don't think you're going to need this, but I wrote it down anyway

To build a dependency graph of a query:


All nodes depend on any variable they use and do not provide


Any node which may provide a variable only does so if no other node has

If a node that may provide a variable does not do so, it may filter that variable


Any node which may depend on a variable does so if it has been provided by another node


Node types:

A Record may provides all of its variables
An Action provides its entity variable if it is unbound
A Not may depend on any variable utilized by nodes in its body
A Union/Choose may provide any variable marked as an output and may depend on any variable utilized by nodes in its body

Variables marked output which are not provided by the Union/Choose must post-filter


An Expression/Aggregate may provides any variable marked output or optional

Variables marked output which are not provided by the Expression/Aggregate must post-filter


A node contributes to the cardinality of any variable it may expand or filter


Ordering a dependency graph:


Pre-sort nodes based on heuristics of your choice (preferably favoring deterministic orderings)
If no nodes remain, you're done
For each node with all dependencies satisfied:
A. If the node has subqueries, order them
B. If the node is capable of providing any un-provided variables, satisfy them
C. If the node must provide an already provided variable, split it into itself + a post-filter via a temporary variable
D. If a node contributes to a variable, note its contribution
E. If all contributors for a contributed variable have been ordered, satisfy it
If no new nodes have been ordered, the graph is unorderable
Repeat from 2