The architecture of all query evaluators can be described in the following fashion. The types in each case may be more or less precise, representations may vary and be more or less first-class, constraints may be statically or dynamically represented, but to a first-order, the architectures all look like this.
At the highest level, we can consider a query evaluator to be a function:
type Evaluator = Query => F[DataOut]
…where F
is some effect context (such as a filesystem). Moving in to a slightly higher level of detail, we can enrich this with a further refinement:
type Evaluator = Query => (IR1 => F[DataIn]) => F[DataOut]
IR1
is "intermediate representation 1". This may of course be Unit
, or could be something richer. Broadly speaking, this just means "the information we pass down to some underlying system which is used to effectfully produce some data". Of course, if we consider Query
to be a form of intermediate representation, then this deconstruction can be performed recursively:
type Evaluator[Q] = Q => Evaluator[IR1] => F[DataOut]
For some IR1
. Thus, all evaluators are comprised of nested sub-evaluators, each taking their own further-simplified representations of the query (or, more likely, subquery), and each yielding some fragment of data. Also note that you can scale this horizontally as well when you wish to have some notion of joining between multiple subquery results:
type Evaluator[Q] = Q => List[Evaluator[IR1]] => F[DataOut]
At some final nesting point, Evaluator
is simply instantiated with a function of the form Path => F[Bytes]
, which presumably blindly reads from some storage.
The intermediate representations here are significant, because this speaks to a pair of elements which must be internal to each evaluator. Specifically:
type Compiler[Q] = Q => (IR1, IR2)
type Engine = (DataIn, IR2) => F[DataOut]
These two functions exist internally, in some form, in each nested evaluator. Also note that the representations here, as in all cases, are conceptual. For example, it is not at all uncommon for IR1
to be a subtree of IR2
, thus allowing for static guarantees regarding data flow and a more natural recursive substructure to the evaluator. In such a case, Compiler[Q]
might be something more like Q => IR2[IR1]
, which is still a pairing but in a different representation.
In the general case, performance of each evaluator is always proportional to the size of DataIn
and DataOut
, or at the very least, the maximum of the two. For evaluators which have multiple sub-evaluators, the joining semantic between the multiple DataIn
s will dictate the order of the proportionality on the whole. For example, an evaluator which performs a full cartesian between its input datasets would be polynomial in its DataIn
s, where the (constant) exponent is the number of nested sub-evaluators.
When DataIn
and DataOut
are particularly small, the performance overhead of the Compiler
function will dominate. However, in most cases, Engine
dominates to a significant order of magnitude. For this reason, in nearly all evaluators, the overhead associated with the DataIn
and DataOut
representations dictates the performance of the whole.