djspiewak/evaluator-architecture.md

## evaluator-architecture.md

      
    Raw
  

              evaluator-architecture.md
            
          
    Query Evaluator Architectural Abstractions

The architecture of all query evaluators can be described in the following fashion. The types in each case may be more or less precise, representations may vary and be more or less first-class, constraints may be statically or dynamically represented, but to a first-order, the architectures all look like this.
Overall Architecture

At the highest level, we can consider a query evaluator to be a function:
type Evaluator = Query => F[DataOut]
…where F is some effect context (such as a filesystem). Moving in to a slightly higher level of detail, we can enrich this with a further refinement:
type Evaluator = Query => (IR1 => F[DataIn]) => F[DataOut]
IR1 is "intermediate representation 1". This may of course be Unit, or could be something richer. Broadly speaking, this just means "the information we pass down to some underlying system which is used to effectfully produce some data". Of course, if we consider Query to be a form of intermediate representation, then this deconstruction can be performed recursively:
type Evaluator[Q] = Q => Evaluator[IR1] => F[DataOut]
For some IR1. Thus, all evaluators are comprised of nested sub-evaluators, each taking their own further-simplified representations of the query (or, more likely, subquery), and each yielding some fragment of data. Also note that you can scale this horizontally as well when you wish to have some notion of joining between multiple subquery results:
type Evaluator[Q] = Q => List[Evaluator[IR1]] => F[DataOut]
At some final nesting point, Evaluator is simply instantiated with a function of the form Path => F[Bytes], which presumably blindly reads from some storage.
Intermediate Representations

The intermediate representations here are significant, because this speaks to a pair of elements which must be internal to each evaluator. Specifically:
type Compiler[Q] = Q => (IR1, IR2)
type Engine = (DataIn, IR2) => F[DataOut]
These two functions exist internally, in some form, in each nested evaluator. Also note that the representations here, as in all cases, are conceptual. For example, it is not at all uncommon for IR1 to be a subtree of IR2, thus allowing for static guarantees regarding data flow and a more natural recursive substructure to the evaluator. In such a case, Compiler[Q] might be something more like Q => IR2[IR1], which is still a pairing but in a different representation.
Performance Analysis

In the general case, performance of each evaluator is always proportional to the size of DataIn and DataOut, or at the very least, the maximum of the two. For evaluators which have multiple sub-evaluators, the joining semantic between the multiple DataIns will dictate the order of the proportionality on the whole. For example, an evaluator which performs a full cartesian between its input datasets would be polynomial in its DataIns, where the (constant) exponent is the number of nested sub-evaluators.
When DataIn and DataOut are particularly small, the performance overhead of the Compiler function will dominate. However, in most cases, Engine dominates to a significant order of magnitude. For this reason, in nearly all evaluators, the overhead associated with the DataIn and DataOut representations dictates the performance of the whole.