pervognsen/incremental_compilation.txt

## incremental_compilation.txt
NOTE: I'm not necessarily advocating this as the way to make your compiler go fast. Priority 1 is to make
full recompilation as fast as possible, ideally on the order of 100 MB/s per core with everything in memory.
Once you get it that fast, there's probably no need for incremental techniques, even if you want to
recompile entire files in real-time as the programmer types in code. But these incremental algorithms are
interesting computer science, so you should learn about them, and they are certainly applicable elsewhere.

After watching an interview with Anders Hejlsberg last year in which he mentioned the incremental
compilation techniques they used in their new C# and TypeScript compilers, I spent a bit of time studying
their open source code (and it's great that MS now makes this stuff available for people to peruse).

The idea turns out to be conceptually simple. It takes minor variations on two well-known techniques,
function memoization and difference lists, and applies them to a hand-written recursive descent parser.

We will start with the lexical analyzer since it demonstrates all the essential ideas.

A lexer function takes a text and a text position as input and produces a token and an updated text position.

Let's apply memoization directly to the lexer function. Then we can certainly cache tokens and reuse them as
long as the text hasn't changed since it was first analyzed. In terms of speeding things up, this lets us
skip forward through the text by token-length strides, and we can do the memo table lookup without actually
inspecting the contents of the text. We only have to use the cursor positions to do the lookup.

But letting us memoize things for a static text isn't very useful. If we knew the text hadn't changed,
we wouldn't bother rescanning it. So how do we deal with changes?

The main idea is to use a purely functional representation of a text as a sequence of changes. Each
text change will be represented by a substring replacement which has an associated cursor interval.

A memoized result will be associated with a given text version. Suppose the current version has just
one change interval associated with it, and we're asked to lex from a given position. How do we
consult the memo cache? Assuming the position is before the change interval, we will look up in the
memo cache, and if there is a result, we will look at the memoized token interval and compare it
to the change interval and see if they overlap. If they don't overlap, we will return the result.

Actually, that isn't quite right. With character lookahead in the lexer, it's possible for the
intervals to be non-overlapping but for the result to still depend on text in the interval invalidated
by the change. We could associate with each memoized result another interval that represents the farthest
ahead the lexer looked in generating the result. But that isn't necessary if we know at design time that
the lexer uses at most k character lookahead. In that case, we know that if the token interval is [m, n],
the lexer cannot depend on any characters outside of [m, n + k]. So all we have to do is extend the interval
by k in the overlap check. Usually k is a small number like 1 or 2.

The other thing to consider is what cursor position to use for the lookup. For a position before the
change interval, we use it as is. For a position after the change interval, we have to apply an offset
equal to the replacement string's length minus the change interval's length.

This extends directly to multiple simultaneous change intervals.

That's it for lexers. What about parsers? Same idea, at least for context-free parsers. We take a cursor
position as input and produce a result and an updated cursor position. We can handle lookahead in two ways,
as before. We can either associate the dynamic lookahead interval directly with the memoized result, or
we can pad the checked interval by a statically known lookahead bound. In an LL(k) parser, this means we'd
have to look at the k tokens after the memoized interval and extend the overlap check by their total length.

Here are some links to places in the TypeScript parser code that show what I'm talking about:

https://github.com/Microsoft/TypeScript/blob/e9e7fcecbd346f18f98aba8c8f6def3e0dabfcd6/src/compiler/parser.ts#L1523
https://github.com/Microsoft/TypeScript/blob/e9e7fcecbd346f18f98aba8c8f6def3e0dabfcd6/src/compiler/parser.ts#L7123

To validate your understanding of these ideas, think through what happens when you make a small edit with highly
non-local effects, such as inserting or deleting a { or } in a language with curly brace block structure.

For the implementation of the memo table, use an array ordered by the text position key, not a hash table. Since
the parser traverses the text in linear order, it will attempt lookups in the memo table in linear order as well,
so this is a substantial optimization.

NOTE: This next part is not from Roslyn or the TypeScript compiler, but my own thinking on the matter.

What about type checkers and the rest of the compilation pipeline? That gets a little trickier, but only
a little bit. If you change a variable declaration, clearly that should potentially have a non-local effect
after the change interval and maybe even before with something like Haskell's where clauses. The main idea
is to ensure that your memoized results explicitly express their input dependencies. Something like
typecheck_function() should certainly depend on the global symbol table, but to check whether a memoized
result for typecheck_function() can be reused, we need to know what global symbols it depends on, and how.
This could be expressed in different ways. One way is to have incrementing version numbers on every entity,
so that to validate a memoized result, we first check all the symbol dependencies and make sure their current
version hasn't changed since their versions when the memoized result was computed. Entity version numbers
are assigned from a global counter that is incremented when a new entity is created, like a virtual timestamp.

This is the same idea as dependency graphs in spreadsheets, but pull-based with a quick invalidation check
via version numbers rather than push-based as in most spreadsheet applications. Pull-based with memoization
is much easier to implement correctly because everything still goes through the normal recursive descent path
while letting us quickly skip parts of the text with cached results without consulting the text contents.

It should also remind you of incremental build systems where the version numbering is a robust form of
make-style timestamp checks. Memoized results expressing their input dependencies explicitly and using that to
drive the decision of whether to reuse results or rebuild is similar to how you can use gcc to generate .d
files that contain make rules which define dependencies between foo.o and the files that were transitively
included by the preprocessor in the process of building foo.c. Thus, between foo.c itself and the transitive
dependencies (usually header files) listed in foo.d, you have the complete set of dynamic dependencies.

Note that you shouldn't bother memoizing every little thing as long as some enclosing entity type is memoized.
For example, you might choose to only memoize at the function level, not at the expression level. Not only
will fine-grained memoization almost certainly be a pessimization, not an optimization, but keeping track
of all the fine-grained input dependencies is needlessly tricky and error prone. Keeping things at the
granularity of top-level entities like functions is almost certainly the right trade-off.

Even if all you care about is memoizing later parts in the pipeline, you typically need to memoize the
earlier parts as well, like lexing and parsing, so that you have some stable identities (like the cached syntax
node pointer that's passed as an argument to a type checking function) to use as keys for the memoization.
It also makes the code a lot simpler since you don't have separate incremental and non-incremental code paths.
You just have a single compile_the_whole_thing() entry point that you always run, and that's that.

The alternative to shared-state notions like version numbers and identity-based memoization is to do everything
based on crypto-strength content hashing. (A hash function that isn't cryptographically strong can only be
used with confidence for early-out rejection, not for certifying equality.) This is conceptually very elegant,
and it has advantages in a distributed system where managing shared state like the version numbers and identity
pointers isn't feasible. Indeed, it's the same general idea used by Git and other distributed version control
systems.
	NOTE: I'm not necessarily advocating this as the way to make your compiler go fast. Priority 1 is to make
	full recompilation as fast as possible, ideally on the order of 100 MB/s per core with everything in memory.
	Once you get it that fast, there's probably no need for incremental techniques, even if you want to
	recompile entire files in real-time as the programmer types in code. But these incremental algorithms are
	interesting computer science, so you should learn about them, and they are certainly applicable elsewhere.

	After watching an interview with Anders Hejlsberg last year in which he mentioned the incremental
	compilation techniques they used in their new C# and TypeScript compilers, I spent a bit of time studying
	their open source code (and it's great that MS now makes this stuff available for people to peruse).

	The idea turns out to be conceptually simple. It takes minor variations on two well-known techniques,
	function memoization and difference lists, and applies them to a hand-written recursive descent parser.

	We will start with the lexical analyzer since it demonstrates all the essential ideas.

	A lexer function takes a text and a text position as input and produces a token and an updated text position.

	Let's apply memoization directly to the lexer function. Then we can certainly cache tokens and reuse them as
	long as the text hasn't changed since it was first analyzed. In terms of speeding things up, this lets us
	skip forward through the text by token-length strides, and we can do the memo table lookup without actually
	inspecting the contents of the text. We only have to use the cursor positions to do the lookup.

	But letting us memoize things for a static text isn't very useful. If we knew the text hadn't changed,
	we wouldn't bother rescanning it. So how do we deal with changes?

	The main idea is to use a purely functional representation of a text as a sequence of changes. Each
	text change will be represented by a substring replacement which has an associated cursor interval.

	A memoized result will be associated with a given text version. Suppose the current version has just
	one change interval associated with it, and we're asked to lex from a given position. How do we
	consult the memo cache? Assuming the position is before the change interval, we will look up in the
	memo cache, and if there is a result, we will look at the memoized token interval and compare it
	to the change interval and see if they overlap. If they don't overlap, we will return the result.

	Actually, that isn't quite right. With character lookahead in the lexer, it's possible for the
	intervals to be non-overlapping but for the result to still depend on text in the interval invalidated
	by the change. We could associate with each memoized result another interval that represents the farthest
	ahead the lexer looked in generating the result. But that isn't necessary if we know at design time that
	the lexer uses at most k character lookahead. In that case, we know that if the token interval is [m, n],
	the lexer cannot depend on any characters outside of [m, n + k]. So all we have to do is extend the interval
	by k in the overlap check. Usually k is a small number like 1 or 2.

	The other thing to consider is what cursor position to use for the lookup. For a position before the
	change interval, we use it as is. For a position after the change interval, we have to apply an offset
	equal to the replacement string's length minus the change interval's length.

	This extends directly to multiple simultaneous change intervals.

	That's it for lexers. What about parsers? Same idea, at least for context-free parsers. We take a cursor
	position as input and produce a result and an updated cursor position. We can handle lookahead in two ways,
	as before. We can either associate the dynamic lookahead interval directly with the memoized result, or
	we can pad the checked interval by a statically known lookahead bound. In an LL(k) parser, this means we'd
	have to look at the k tokens after the memoized interval and extend the overlap check by their total length.

	Here are some links to places in the TypeScript parser code that show what I'm talking about:

	https://github.com/Microsoft/TypeScript/blob/e9e7fcecbd346f18f98aba8c8f6def3e0dabfcd6/src/compiler/parser.ts#L1523
	https://github.com/Microsoft/TypeScript/blob/e9e7fcecbd346f18f98aba8c8f6def3e0dabfcd6/src/compiler/parser.ts#L7123

	To validate your understanding of these ideas, think through what happens when you make a small edit with highly
	non-local effects, such as inserting or deleting a { or } in a language with curly brace block structure.

	For the implementation of the memo table, use an array ordered by the text position key, not a hash table. Since
	the parser traverses the text in linear order, it will attempt lookups in the memo table in linear order as well,
	so this is a substantial optimization.

	NOTE: This next part is not from Roslyn or the TypeScript compiler, but my own thinking on the matter.

	What about type checkers and the rest of the compilation pipeline? That gets a little trickier, but only
	a little bit. If you change a variable declaration, clearly that should potentially have a non-local effect
	after the change interval and maybe even before with something like Haskell's where clauses. The main idea
	is to ensure that your memoized results explicitly express their input dependencies. Something like
	typecheck_function() should certainly depend on the global symbol table, but to check whether a memoized
	result for typecheck_function() can be reused, we need to know what global symbols it depends on, and how.
	This could be expressed in different ways. One way is to have incrementing version numbers on every entity,
	so that to validate a memoized result, we first check all the symbol dependencies and make sure their current
	version hasn't changed since their versions when the memoized result was computed. Entity version numbers
	are assigned from a global counter that is incremented when a new entity is created, like a virtual timestamp.

	This is the same idea as dependency graphs in spreadsheets, but pull-based with a quick invalidation check
	via version numbers rather than push-based as in most spreadsheet applications. Pull-based with memoization
	is much easier to implement correctly because everything still goes through the normal recursive descent path
	while letting us quickly skip parts of the text with cached results without consulting the text contents.

	It should also remind you of incremental build systems where the version numbering is a robust form of
	make-style timestamp checks. Memoized results expressing their input dependencies explicitly and using that to
	drive the decision of whether to reuse results or rebuild is similar to how you can use gcc to generate .d
	files that contain make rules which define dependencies between foo.o and the files that were transitively
	included by the preprocessor in the process of building foo.c. Thus, between foo.c itself and the transitive
	dependencies (usually header files) listed in foo.d, you have the complete set of dynamic dependencies.

	Note that you shouldn't bother memoizing every little thing as long as some enclosing entity type is memoized.
	For example, you might choose to only memoize at the function level, not at the expression level. Not only
	will fine-grained memoization almost certainly be a pessimization, not an optimization, but keeping track
	of all the fine-grained input dependencies is needlessly tricky and error prone. Keeping things at the
	granularity of top-level entities like functions is almost certainly the right trade-off.

	Even if all you care about is memoizing later parts in the pipeline, you typically need to memoize the
	earlier parts as well, like lexing and parsing, so that you have some stable identities (like the cached syntax
	node pointer that's passed as an argument to a type checking function) to use as keys for the memoization.
	It also makes the code a lot simpler since you don't have separate incremental and non-incremental code paths.
	You just have a single compile_the_whole_thing() entry point that you always run, and that's that.

	The alternative to shared-state notions like version numbers and identity-based memoization is to do everything
	based on crypto-strength content hashing. (A hash function that isn't cryptographically strong can only be
	used with confidence for early-out rejection, not for certifying equality.) This is conceptually very elegant,
	and it has advantages in a distributed system where managing shared state like the version numbers and identity
	pointers isn't feasible. Indeed, it's the same general idea used by Git and other distributed version control
	systems.