Skip to content

Instantly share code, notes, and snippets.

@joshuafcole
Last active September 14, 2016 00:03
Show Gist options
  • Save joshuafcole/6d8cdd44613d918a07be9f28ed38fc76 to your computer and use it in GitHub Desktop.
Save joshuafcole/6d8cdd44613d918a07be9f28ed38fc76 to your computer and use it in GitHub Desktop.

Aggregation Statification Station Vacation III: Tokyo Drift

Misc. Notes

  • Only aggregates require stratification

  • We want to aim for the fewest strata possible

  • Aggregates that look the same should always have the same output

  • Aggregates which depend on their own outputs (directly or otherwise) are unorderable

  • Aggregates which filter (rather than provide) the rows contributing to their inputs are orderable

  • Aggregates have four kinds of variables output = aggregate[input, given: projection, per: grouping]

    • output - Must not be bound at the time the aggregate runs
      • We can relax this through post-filtering
    • input - Must be fully satisfied before the aggregate's value is correct
      • We can relax this slightly in the case of grouped aggregates
    • projection - Indicate additional uniqueness over the input rows.
      • These have identical properties as inputs
    • grouping - Indicate the output cardinality of the aggregate
      • These don't have any special satisfaction requirements

To compute a single aggregate:

  • If an aggregate's output is bound, it may be decomposed into an aggregate with an unbound output and a post-filter (similar to expressions)
  • Conceptually, an aggregate maps a set of rows indicated by the projection to a single prefix row indicated by its grouping
  • Since the aggregate runs once per unique grouping prefix, the grouped variables do not need to be fully satisfied prior to execution
  • Aggregate output will only be correct when the entire set of unique rows for its projection have been stabilized
    • This means that the cardinality of the terms in the projection must not change after this point
    • Ungrouped aggregates run over the empty prefix
    • These rows are the suffix set of unique rows for one of its groupings
    • If the rest of the evaluator is incremental, you can just keep issuing updates, but that's pretty expensive

Process

  1. Block until every variable it is grouped on has been satisfied for a single prefix - Other variables may also be in this prefix, in which case the aggregate will be run multiple times for the same input. Feel free to get clever here
  2. Accumulate all rows that are suffixes of this prefix (easily done by evaluating the aggregate on its trailing edge)
  3. Project those down by the aggregate's projection
  4. Evaluate
  5. Satisfy the aggregate output variable

To stratify a query:

  • Stratification on aggregates can be viewed as a telescoping generic join
  • The prefix portion of the join has a cardinality of C X #GROUPS
  • The suffix portion of the join has a cardinality of K X #PROJECTED per group
  • To ensure that aggregates always end up operating over the same input, we must run them as early as possible
  1. Satisfy the cheapest variable of the row, adding it to the prefix
  2. Find all rows that match the new prefix
  3. For each row, repeat from 1
  4. If satisfying this variable completely fills the grouping of an aggregate: A. Compute it B. Satisfy its output variable C. Repeat from 4 D. If the aggregate output variable was initially bound, post-filter E. Repeat from 1

Other stuff

  • I don't think you're going to need this, but I wrote it down anyway

To build a dependency graph of a query:

  • All nodes depend on any variable they use and do not provide

  • Any node which may provide a variable only does so if no other node has

    • If a node that may provide a variable does not do so, it may filter that variable
  • Any node which may depend on a variable does so if it has been provided by another node

  • Node types:

    • A Record may provides all of its variables
    • An Action provides its entity variable if it is unbound
    • A Not may depend on any variable utilized by nodes in its body
    • A Union/Choose may provide any variable marked as an output and may depend on any variable utilized by nodes in its body
      • Variables marked output which are not provided by the Union/Choose must post-filter
    • An Expression/Aggregate may provides any variable marked output or optional
      • Variables marked output which are not provided by the Expression/Aggregate must post-filter
  • A node contributes to the cardinality of any variable it may expand or filter

Ordering a dependency graph:

  1. Pre-sort nodes based on heuristics of your choice (preferably favoring deterministic orderings)
  2. If no nodes remain, you're done
  3. For each node with all dependencies satisfied: A. If the node has subqueries, order them B. If the node is capable of providing any un-provided variables, satisfy them C. If the node must provide an already provided variable, split it into itself + a post-filter via a temporary variable D. If a node contributes to a variable, note its contribution E. If all contributors for a contributed variable have been ordered, satisfy it
  4. If no new nodes have been ordered, the graph is unorderable
  5. Repeat from 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment