twitu/how-to-architect-a-query-compiler.md

## how-to-architect-a-query-compiler.md

      
    Raw
  

              how-to-architect-a-query-compiler.md
            
          
    How to architect a query compiler

This paper describes the various techniques that are used to create a query compiler that converts high-level (declarative) logic into low-level optimized (imperative) logic. The paper argues that conventional methods for compiling queries i.e. template expanders have the following weaknesses -

Everything is compiled in a single big stage which makes the logic complex and difficult to modify/expand
Combinations cause an explosion of possible cases because each combination has to be checked with all others making it O(n^2)

The paper argues that having multiple stages that gradually lower the query through multiple DSLs each performing a specific type of transformation is a scalable and effective approach. The paper defines the two types of transformations.

Optimizations - when the transformation occurs at the same level i.e. the construct in a DSL is expressed differently but within the same DSL.
Lowerings - when a transformation converts a higher DSL to a lower DSL

The paper also establishes two principles that help determine what kind of DSL is needed when and where does each type of tranformation fit in. They are -

Extensibility principle - A low level DSL should have greater than or equal to expressiveness of a high level DSL
Transformation cohesion principle - There should be a unique path lowering a high-level DSL to a low-level DSL. This also prevents loops between high and low level DSLs.

Here a higher DSL is one which is more declarative while the lower one is more imperative. The trade-offs between them are as follows.


High level
Low level


smaller search space
more granularity


query level optimizations
memory and data structure like optimizations


simple to parallelize
more predictable performance


This paper also introduces key terminologies with examples and uses that are covered in the below table.


Name
Meaning
Category
Advantage


loop fusion
multiple loops are "fused" into a single loop which performs all of their combined operations
optimization
better cache locality


collection programming
a practice where logic is defined using generic operations on data types e.g. Spark
programming style
declarative APIs best used for query frontend


annotations
Used to "annotate" UDFs with side effects
programming style
allows compiler to reason about them and optimize them


Abstract Sytax Tree (AST)
Tree like representation of query
IR
allows join reordering like optimizations


Common Subexpression Elimiation (CSE)
DAG for sharing leaves between subtrees which represent variables and value assignment
IR
Allows data flow analysis


Administrative Normal Form (ANF)
Canonical representation for a program
Notation
Provides CSE and allows data flow analysis


The paper demonstrates how these different transformations and heuristics come together to create a query compiler for a portion of SQL.
Questions


Q. Phase ordering problem is all about ordering different kinds of optimizations to find the best possible optimization. This is a difficult problem because some orderings of optimizations can get stuck in a local minima and it is very difficult to find the best one. Does the egg: Fast and Extensible Equality Saturation solve this? Where does this technique fit into this argument?

A. As answered by the author, phase ordering problem are much more relevant to low level optimizations. Most of the optimizations in this paper of a high level in nature. However the one of the advantages of having multiple DSLs which progressively become low level, phase ordering and equality saturation can be introduced in a stage wherever appropriate.


Q. Are there any downsides to multi-stage DSL approach? This looks like a universally better solution? What might be the reasons for it not being adopted earlier?

A. As answered by the author, some downsides of this technique is that its very hard to get right. Any mistakes in lowering can result in changing or even breaking the semantics of the program. Further recommended reading on this question is How to architect a query compiler revisited
High level	Low level
smaller search space	more granularity
query level optimizations	memory and data structure like optimizations
simple to parallelize	more predictable performance
Name	Meaning	Category	Advantage
loop fusion	multiple loops are "fused" into a single loop which performs all of their combined operations	optimization	better cache locality
collection programming	a practice where logic is defined using generic operations on data types e.g. Spark	programming style	declarative APIs best used for query frontend
annotations	Used to "annotate" UDFs with side effects	programming style	allows compiler to reason about them and optimize them
Abstract Sytax Tree (AST)	Tree like representation of query	IR	allows join reordering like optimizations
Common Subexpression Elimiation (CSE)	DAG for sharing leaves between subtrees which represent variables and value assignment	IR	Allows data flow analysis
Administrative Normal Form (ANF)	Canonical representation for a program	Notation	Provides CSE and allows data flow analysis