Skip to content

Instantly share code, notes, and snippets.

@MonoidMusician
Last active October 30, 2020 16:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MonoidMusician/2e2b6bcbd60c056083a720921589ec8d to your computer and use it in GitHub Desktop.
Save MonoidMusician/2e2b6bcbd60c056083a720921589ec8d to your computer and use it in GitHub Desktop.
My ideas for compiler source spans

The problem

Information is lost when going from concrete syntax to AST, and information also gets lost when doing operations on that AST.

So I have a novel approach to compiler source spans that’s been on my mind for years. I haven’t exactly implemented it, though code for it does appear in my mess of dhall-purescript, which is my project to create an interactive structural editor plus dhall implementation. But it isn’t exactly integrated into producing useful output yet.

Anyways.

The Solution

My opinion on compiler source spans is that:

  • Every AST node should be associated with at least one source span.
  • Every AST node should be able to have more than one source span.
  • Source spans should track not just the actual literal position in source, but information about how it was derived from positions in the source.

Essentially source spans are going to represent traces of how the information moves through the compiler, starting from source that the user enters (either as text or as an AST already), and then going to whatever output the compiler is asked to produce.

I think it might be rather expensive and memory heavy to track, but no matter!

Example of how it works

In dhall-purescript, there’s a few key operations that happen on AST nodes:

  • Dealing with context (i.e. substituting and shifting variables into/out of scope … I haven’t quite settled on the best way to represent these operations abstractly yet)
  • Type checking
  • Normalizing
  • Unification, as part of type checking
  • Getting an AST node nested inside another one

(Dhall doesn’t currently have any unification variables, that’s another discussion, but lite forms of unification occur to make sure that types line up, of course.)

Executing each of these operations on an AST node type that supports tagging, will produce a new AST node tagged with a label that says something like new = operation(old), in addition to whatever tags come from the pipeline. For example, if you typecheck the node that represents the top of level of the program, the result will come tagged saying that it is the type of the top level of the program. Does that make sense?

So I essentially have a sum type/variant that has the following things:

  • "within": Position
  • "typecheck"
  • "normalize"
  • "alphaNormalize"
  • context operations (provisionally they are "subsitute", which substitutes everything known in context, and "shift", which only targets one variable, but that is subject to change)

Each source span is really a derivation that starts at some original source span and then follows a bunch of the above derivations to arrive at a final AST node, and along the way, other source spans may occur as a result of unification and the redundancy of typed functional programming.

Unification

You’ll notice that unification isn’t on the list above. I honestly don’t remember why that is? Maybe I just don’t unify things in my current implementation. I think I was working on factoring it out but didn’t get to it.

But conceptually unification would be an operation that takes two ASTs, ensures they have that same shape, and produces an AST that not only tells you that it is the result of unifying those two ASTs, but also combines that with the information that those AST nodes were already tagged with. That means that instead of a list of those source span derivations, we need a tree … yikes.

Maybe I opted for just directly combining the source spans, and not representing unification as its own operation, which is probably smart.

Side benefits

You could track desugaring steps in source spans, and use that to reconstruct a more concrete syntax. You’ll have to deal with too much information and not enough information. Basically you’ll need some Monoid to combine the results. You’re guaranteed to have some information available that relates back to some source span, but it might not directly be a source span (and instead be indirectly derived from a source span), so you’ll have to handle the case where it doesn’t exist. And you might get contradictory results, if things from different source spans are unified.

I’m pretty sure you can track usages of things in the source code and come up with some sort of “topology” of the source code based on how things are unified. But that’s not something I’ve thought about enough to get a sense for how practical it is.

Difficulties

As I mentioned above, memory and performance might not be good at all. Especially because it sort of produces a combinatorial explosion. Maybe I should instead do some Free structure that represents what’s happening better, e.g. if there’s like a cartesian product, you could save {a,b}x{c,d} instead of {ac,ad,bc,bd} … I don’t know. I have not been trying to optimize it yet.

It is very tricky to find the right operations, especially around handling context. (Dhall is specified using very granular operations for shift and substitute, but there needs to be a higher level description.) Especially because it seems ridiculous to make the source spans reference AST nodes, so you need some other way to describe what values you substituted …

The combinatorial explosion also poses a problem when you’re trying to extract useful information from the deluge of source spans. I think there should be some sense of ordering to the operations, so the most direct description of the position of the node occurs first. But other than that it’s the wild west. I think the user should probably be shown the “default” first source span, and then have the option to scroll through other source spans to see where information is coming from.

Also, because there’s a lot more information floating around, you have a lot more choices to make when designing the compiler. I actually, for this reason, started having typechecking produce less normalized terms, because they can always be cleaned up with normalization later, but source span information is better preserved if there is less normalization happening, imo.

Conclusion

This is just a mess of ideas, nothing’s figured out but the possibilities are exciting! But let me know if you have responses, either in the comments below or on Slack or Zulip.

I think it could be great for users to have more insight into what the compiler is doing. I envision a conversation like this:

Why is it saying it can’t match Integer with String?

looks at source spans/derivation, which show that Integer is coming from trying to annotate a function with inferred type String -> String as Integer -> String, which itself is from a definition somewhere else

Oh, I remember now, I changed my pretty printer to take an int instead of a string for indentation. I can fix this now!

mission accomplished

I’ll use emoji to denote positions. Sorry if looks horrible, I’m on mac:

  • 0️⃣-1️⃣7️⃣: source spans
  • 🔼: get type of what came before
  • ⏭: normalize what came before
  • ⏪⏩: get left/right side of an AST node
  • *️⃣: a particular substitution
  • ➕: combination of two positions that co-occur as a result of unification

Context shifts are omitted.

0️⃣ let Predicate = 1️⃣ \(a : 2️⃣ Type) -> 3️⃣ (4️⃣ a -> 5️⃣ Boolean) in
6️⃣ let f : 7️⃣ (8️⃣ Predicate 9️⃣ Natural) = 1️⃣0️⃣ \(a : 1️⃣1️⃣ Natural) -> 1️⃣2️⃣ (1️⃣3️⃣ Natural/isZero 1️⃣4️⃣ a) in
1️⃣5️⃣(1️⃣6️⃣ f 1️⃣7️⃣ 0)

Let’s assume that Predicate : Type -> Type typechecks and walk through why f typechecks:

  • First we need to infer the type of 1️⃣0️⃣.
  • We look at 1️⃣1️⃣ Natural and see that it has normalized type 1️⃣1️⃣🔼⏭ Type, which looks good since it is a universe.
  • So we push 1️⃣0️⃣ a : 1️⃣1️⃣ Natural onto the context.
  • Now we look at 1️⃣2️⃣, and we see we need the type of 1️⃣3️⃣, which is a builtin, so it has a type 1️⃣3️⃣🔼 Natural -> Boolean with no other source span (or a special “builtin” source span). Normalized it is the same, of course, so this is indeed a function type, in particular its input type is 1️⃣3️⃣🔼⏭⏪ Natural and its output is 1️⃣3️⃣🔼⏭⏩ Boolean.
  • Now we need to apply it to 1️⃣4️⃣ a, which we see has type 1️⃣4️⃣🔼➕1️⃣1️⃣ Natural based on context, so it matches up with the input (technically this could produce a unified node 1️⃣3️⃣🔼⏭⏪➕1️⃣4️⃣🔼⏭➕1️⃣1️⃣⏭ Natural, but this is not used anywhere???).
  • The output is therefore 1️⃣3️⃣🔼⏭⏩ Boolean and the function overall has type 1️⃣0️⃣🔼 (1️⃣1️⃣ Natural -> 1️⃣3️⃣🔼⏭⏩ Boolean).
  • Now we need to see if this unifies with 7️⃣ (8️⃣ Predicate 9️⃣ Natural). So we compute 7️⃣⏭ via some reductions:
    • 8️⃣ Predicate8️⃣⏭➕1️⃣⏭ \(a : 2️⃣⏭ Type) -> 3️⃣⏭ (4️⃣⏭ a -> 5️⃣⏭ Boolean)
    • 7️⃣ (8️⃣ Predicate 9️⃣ Natural)7️⃣⏭➕3️⃣⏭*️⃣ (9️⃣⏭➕4️⃣⏭*️⃣ Natural -> 5️⃣⏭*️⃣ Boolean) (under the substitution *️⃣ meaning a = 9️⃣ Natural).
  • It does unify, so the result is that f has type 7️⃣ (8️⃣ Predicate 9️⃣ Natural). (Again, there’s a potential unification node with span 7️⃣⏭➕3️⃣⏭*️⃣➕1️⃣0️⃣🔼 (9️⃣⏭➕4️⃣⏭*️⃣➕1️⃣1️⃣ Natural -> 5️⃣⏭*️⃣➕1️⃣3️⃣🔼⏭⏩ Boolean), but it isn’t used here, because we return the least normalized thing.)

Holey Inference

Some ideas for a new style of inference for Dhall, not based on bidirectional typing, but instead a kind of bottom-up unification algorithm.

First the AST is elaborated into a form where all missing annotations are given some unique ID. This could be a counter or a pointer to the node in the AST.

There’s two main ideas:

  • Holes can be drilled down/refined using standard and nonstandard operations (e.g. “type of this node”, “domain of pi type”, “result of dot access”)
  • Refined data is kept in a map from the id of the original hole to the scaffold of its new structure
  • This process of refining holes (usually) generates substitutions

Essentially the hole refinements are kept as a sort of eta-expansion, based on the hole paths. So if a hole is expected to be a function type, instead of being #0, it becomes #0.domain -> #0.codomain, or something.

Unifications are checked for compatibility via the refined hole shapes, so if a hole is requested to be both a function type and a record type, something has gone wrong.

The idea is to keep this all commutative-applicative-style and build it up from bottom to top, with minimal changes to inference, just the process of filling holes in. At the end of the process, either some errors will have occurred, or we will have partial results for holes that mean that typing is consistent if those can be filled in fully, or we will have full results for holes, which means that typing is consistent that each hole has a unique solution which could then be substituted in. For partial results, the available information arising from all the unifications would be shown to the user so they can fill them in. For most cases, only types will be able to be filled in, so value-level holes will need user help.

It might be necessary to track the actual written record of unifications, for later analysis. E.g. a unification like T.a : T would signal that T = < a | ... >. Higher-order unification needs some thought too. Maybe this also means that hints could be given (“Did you forget record access here?”).

It’d be fun if we could automatically generate this from the typing rules!!

Algo Details

The whole process centers around unification, taking place in an appropriate monadic/parallel applicative context with error handling and state.

Instead of operating in the normal state monad, like conventional unification, state is combined quasi-monoidally – that is, unification state is combined where possible, and an error will be thrown if it is not possible to combine state from parallel branches of unification.

This potentially allows for multiple errors to be generated at once, e.g. from different fields of a record, instead of halting at the first error.

Filling in structure

Basic cases

Overloaded cases

Substituting

Higher-order unification

We know that functions from Type are parametric, so we can do some fun things.

Eta-expansion???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment