Skip to content

Instantly share code, notes, and snippets.

@twitu
Last active October 27, 2022 14:37
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save twitu/221c8349887cec0a83b395e4cbb492a7 to your computer and use it in GitHub Desktop.
Save twitu/221c8349887cec0a83b395e4cbb492a7 to your computer and use it in GitHub Desktop.
Summary of how to architect a query compiler

This paper describes the various techniques that are used to create a query compiler that converts high-level (declarative) logic into low-level optimized (imperative) logic. The paper argues that conventional methods for compiling queries i.e. template expanders have the following weaknesses -

  • Everything is compiled in a single big stage which makes the logic complex and difficult to modify/expand
  • Combinations cause an explosion of possible cases because each combination has to be checked with all others making it O(n^2)

The paper argues that having multiple stages that gradually lower the query through multiple DSLs each performing a specific type of transformation is a scalable and effective approach. The paper defines the two types of transformations.

  • Optimizations - when the transformation occurs at the same level i.e. the construct in a DSL is expressed differently but within the same DSL.
  • Lowerings - when a transformation converts a higher DSL to a lower DSL

The paper also establishes two principles that help determine what kind of DSL is needed when and where does each type of tranformation fit in. They are -

  • Extensibility principle - A low level DSL should have greater than or equal to expressiveness of a high level DSL
  • Transformation cohesion principle - There should be a unique path lowering a high-level DSL to a low-level DSL. This also prevents loops between high and low level DSLs.

Here a higher DSL is one which is more declarative while the lower one is more imperative. The trade-offs between them are as follows.

High level Low level
smaller search space more granularity
query level optimizations memory and data structure like optimizations
simple to parallelize more predictable performance

This paper also introduces key terminologies with examples and uses that are covered in the below table.

Name Meaning Category Advantage
loop fusion multiple loops are "fused" into a single loop which performs all of their combined operations optimization better cache locality
collection programming a practice where logic is defined using generic operations on data types e.g. Spark programming style declarative APIs best used for query frontend
annotations Used to "annotate" UDFs with side effects programming style allows compiler to reason about them and optimize them
Abstract Sytax Tree (AST) Tree like representation of query IR allows join reordering like optimizations
Common Subexpression Elimiation (CSE) DAG for sharing leaves between subtrees which represent variables and value assignment IR Allows data flow analysis
Administrative Normal Form (ANF) Canonical representation for a program Notation Provides CSE and allows data flow analysis

The paper demonstrates how these different transformations and heuristics come together to create a query compiler for a portion of SQL.

Questions

  • Q. Phase ordering problem is all about ordering different kinds of optimizations to find the best possible optimization. This is a difficult problem because some orderings of optimizations can get stuck in a local minima and it is very difficult to find the best one. Does the egg: Fast and Extensible Equality Saturation solve this? Where does this technique fit into this argument?

    • A. As answered by the author, phase ordering problem are much more relevant to low level optimizations. Most of the optimizations in this paper of a high level in nature. However the one of the advantages of having multiple DSLs which progressively become low level, phase ordering and equality saturation can be introduced in a stage wherever appropriate.
  • Q. Are there any downsides to multi-stage DSL approach? This looks like a universally better solution? What might be the reasons for it not being adopted earlier?

    • A. As answered by the author, some downsides of this technique is that its very hard to get right. Any mistakes in lowering can result in changing or even breaking the semantics of the program. Further recommended reading on this question is How to architect a query compiler revisited
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment