Information is lost when going from concrete syntax to AST, and information also gets lost when doing operations on that AST.
So I have a novel approach to compiler source spans that’s been on my mind for years. I haven’t exactly implemented it, though code for it does appear in my mess of dhall-purescript, which is my project to create an interactive structural editor plus dhall implementation. But it isn’t exactly integrated into producing useful output yet.
Anyways.
My opinion on compiler source spans is that:
- Every AST node should be associated with at least one source span.
- Every AST node should be able to have more than one source span.
- Source spans should track not just the actual literal position in source, but information about how it was derived from positions in the source.
Essentially source spans are going to represent traces of how the information moves through the compiler, starting from source that the user enters (either as text or as an AST already), and then going to whatever output the compiler is asked to produce.
I think it might be rather expensive and memory heavy to track, but no matter!
In dhall-purescript, there’s a few key operations that happen on AST nodes:
- Dealing with context (i.e. substituting and shifting variables into/out of scope … I haven’t quite settled on the best way to represent these operations abstractly yet)
- Type checking
- Normalizing
- Unification, as part of type checking
- Getting an AST node nested inside another one
(Dhall doesn’t currently have any unification variables, that’s another discussion, but lite forms of unification occur to make sure that types line up, of course.)
Executing each of these operations on an AST node type that supports tagging, will produce a new AST node tagged with a label that says something like new = operation(old)
, in addition to whatever tags come from the pipeline. For example, if you typecheck the node that represents the top of level of the program, the result will come tagged saying that it is the type of the top level of the program. Does that make sense?
So I essentially have a sum type/variant that has the following things:
- "within": Position
- "typecheck"
- "normalize"
- "alphaNormalize"
- context operations (provisionally they are "subsitute", which substitutes everything known in context, and "shift", which only targets one variable, but that is subject to change)
Each source span is really a derivation that starts at some original source span and then follows a bunch of the above derivations to arrive at a final AST node, and along the way, other source spans may occur as a result of unification and the redundancy of typed functional programming.
You’ll notice that unification isn’t on the list above. I honestly don’t remember why that is? Maybe I just don’t unify things in my current implementation. I think I was working on factoring it out but didn’t get to it.
But conceptually unification would be an operation that takes two ASTs, ensures they have that same shape, and produces an AST that not only tells you that it is the result of unifying those two ASTs, but also combines that with the information that those AST nodes were already tagged with. That means that instead of a list of those source span derivations, we need a tree … yikes.
Maybe I opted for just directly combining the source spans, and not representing unification as its own operation, which is probably smart.
You could track desugaring steps in source spans, and use that to reconstruct a more concrete syntax. You’ll have to deal with too much information and not enough information. Basically you’ll need some Monoid to combine the results. You’re guaranteed to have some information available that relates back to some source span, but it might not directly be a source span (and instead be indirectly derived from a source span), so you’ll have to handle the case where it doesn’t exist. And you might get contradictory results, if things from different source spans are unified.
I’m pretty sure you can track usages of things in the source code and come up with some sort of “topology” of the source code based on how things are unified. But that’s not something I’ve thought about enough to get a sense for how practical it is.
As I mentioned above, memory and performance might not be good at all. Especially because it sort of produces a combinatorial explosion. Maybe I should instead do some Free structure that represents what’s happening better, e.g. if there’s like a cartesian product, you could save {a,b}x{c,d}
instead of {ac,ad,bc,bd}
… I don’t know. I have not been trying to optimize it yet.
It is very tricky to find the right operations, especially around handling context. (Dhall is specified using very granular operations for shift and substitute, but there needs to be a higher level description.) Especially because it seems ridiculous to make the source spans reference AST nodes, so you need some other way to describe what values you substituted …
The combinatorial explosion also poses a problem when you’re trying to extract useful information from the deluge of source spans. I think there should be some sense of ordering to the operations, so the most direct description of the position of the node occurs first. But other than that it’s the wild west. I think the user should probably be shown the “default” first source span, and then have the option to scroll through other source spans to see where information is coming from.
Also, because there’s a lot more information floating around, you have a lot more choices to make when designing the compiler. I actually, for this reason, started having typechecking produce less normalized terms, because they can always be cleaned up with normalization later, but source span information is better preserved if there is less normalization happening, imo.
This is just a mess of ideas, nothing’s figured out but the possibilities are exciting! But let me know if you have responses, either in the comments below or on Slack or Zulip.
I think it could be great for users to have more insight into what the compiler is doing. I envision a conversation like this:
Why is it saying it can’t match
Integer
withString
?looks at source spans/derivation, which show that
Integer
is coming from trying to annotate a function with inferred typeString -> String
asInteger -> String
, which itself is from a definition somewhere elseOh, I remember now, I changed my pretty printer to take an int instead of a string for indentation. I can fix this now!
mission accomplished