Skip to content

Instantly share code, notes, and snippets.

@jaredly
Last active February 10, 2023 18:58
Show Gist options
  • Save jaredly/13abec33c7576c4636ca87039a999cf2 to your computer and use it in GitHub Desktop.
Save jaredly/13abec33c7576c4636ca87039a999cf2 to your computer and use it in GitHub Desktop.
Type inference that sticks

Type inference that sticks

What if type inference felt like a real-time conversation, instead of emailing code back and forth?

With type inference, compilers are faced with a difficult task: intuiting what types a user had in mind (but didn't write down) while writing some code. It all falls apart when the types that the algorithm assigns are different from those the user expected, and especially so when the type error reported is far away from the site of mental mismatch.

Amnesiac type inference

Current type inference algorithms perform just as well (and just as poorly) on code entered & modified over time in an IDE as they do on a text file read from a disk. Or, for that matter, on code written with pen & paper and then fed to a compiler character by character via teletype.

That is to say, they have no knowledge of, or interest in, the process by which the code was written. The only feedback loop they recognize is receiving all of the code at once, and responding with the first error they come across, or if you're lucky, several independent errors. Fixing a given error, instead of being a context-sensitive conversation ("here's the resolution to that error, please continue"), is back to a clean-slate data dump of all the code (modulo some caching for performance, but caching does not -- must not -- change the semantics of type inference).

To annotate or not to annotate

When tracking down a hard-to-find type error, one solution is to start nailing things down: add in the types that you had in mind in the first place, and see where the compiler got tripped up. On the other hand, you could have just written in those types all along, and saved yourself the trouble.

So there's this tradeoff between brevity and verbosity of the language syntax. The more information included in the source code, the less guess-work the compiler / type checker has to do, and the less chance there is for a mismatch between the programmer's mental model and the computer's. But more verbosity results in not only more typing, but it can also disrupt the flow of writing code. In an exploratory mode, the user doesn't necessarily know the types of function arguments before they have written the body of the function. And then there's readability tradeoffs on both sides of the spectrum: for experienced programmers, having types written explicitly all over the place can be very noisy and actually inhibit code comprehension, while those same annotations are a boon for someone unfamiliar with the codebase.

There are also times when inference algorithm limitations have negative impacts on other aspects of the language. Flow, facebook's type checker for javascript, originally had global type inference, but in the interest of speed (especially on large codebases such as facebook has) recent versions require you to annotate all functions exported from a file, even when those functions have trivially inferrable types. Ocaml's type inference covers most of the language, but type-abstracted modules (known as Functors) generally cannot be inferred, and need explicitly type annotations.

Roc, on the other hand, follows Elm's admirable commitment to having all type annotations be optional, such that any correct program can have all of the annotations removed and still have the same semantics. The trade-off there is that Roc doesn't support Rank-2 polymorphism (you can't pass a generic function as a function argument), because it wouldn't be able to be fully inferred. Haskell on the other hand, never one to compromise on power, allows you to turn on "Rank-N" types, under which type inference is undecideable in the general case.

A sufficiently interactive compiler

With jerd, I'm trying to build a compiler that's "sufficiently interactive" instead of sufficiently smart. Instead of trying to reconstruct the programmer's intentions from a text-dump of source code, the compiler is involved in the entire process of development, both responding to programmer action and giving context-aware feedback in return. In this model, the source code is enriched with the compiler's inferences along the way, which the programmer can inspect & correct (but which can also be elided from display for a more compact presentation). Type inference is therefore incremental and reified in the program source, meaning that, once a function has been written, the compiler need make no guesses about types or the identities of referenced terms.

As a simple example, consider autocomplete. Isn't it funny that autocomplete can enable you to choose exactly what function you want to call, and what local variable you want to reference, but then when the compiler comes along all it has to go on is the autocompleted text? It has to duplicate all of the work autocomplete did, constructing local scopes and performing top-level name resolution, and if the term has since been renamed then it just falls over.

When typing an identifier in jerd, if you select something from autocomplete, the ID of that term is stuck onto the identifier node (e.g. myFn#4ea8ff). This way, the compilation pass doesn't need to do any name resolution -- it knows exactly what is being referenced. And, if you later change the name of a variable, none of the references are broken. (In jerd, references are by a hash of the definition's syntax tree instead of by name, similar to what unison does.)

Sticky type inference

Say you're writing a function with two arguments. Before the body is defined, the arguments have the universal type "𝕌".

The interface looks like this:

(defn movieFromLine [line idx]
    )

But the underlying data has reified ("sticky") type information for all identifiers, something like this:

(defn movieFromLine [line {id: 0, type: 𝕌}
                     idx  {id: 1, type: 𝕌}]
    )

Then, when you use one of the arguments, the type annotation for that argument gets updated, if possible, to the intersection of the current annotation and the type required by the new usage.

What you see:

(defn movieFromLine [line idx]
    (switch (split line "!")
        ))

What the compiler sees (again, this isn't inferred by the compiler, it's actually persisted in the source tree):

(defn movieFromLine [line {id: 0, type: string}
                     idx  {id: 1, type: 𝕌}]
    (switch (split {id: '#ead31a8',
                    type: (fn [string string] (array string))}
                line {id: 0}
                "!")
        ))

If there is no intersection, then the new usage is marked as an error, but you're also given the option (via dropdown) to switch the annotation to align with your new usage, and marking any other incompatible usages as erroneous.

See how it's a conversation?

This functionality is intimitely connected to the fact that jerd is a projectional language, and that there is no provision for editing source code as "flat text". Because of this, the compiler can rely on having much more information persisted in the source tree, which would otherwise be provibitively verbose if it had to be printed & parsed as syntax. Similarly, interactions with the compiler are no longer text-first with optional enhancement by a separate IDE; the compilation process is interactive from the ground up, such that ambiguity doesn't have to be an error -- it can just be a dropdown.

Prior Art

Lamdu's editor has a similar interaction to this, where if you add a new use of a variable that's incompatible with current uses, then the new usage is marked as the error. I don't know if they store the inferred types in the syntax tree, or if they re-compute them on every edit.

Further reading

I'm hoping to write quite a bit more about jerd as I'm fleshing out the language and development environment, but in the meantime you can take a look at the WIP documentation for the type system (which borrows quite a bit from Roc) and the algebraic effects setup, which is inspired by, but fairly different from, that of unison and eff.

This whole endeavor is quite experimental, so if you have any feedback or suggestions, I'd love to hear them 😄 leave a comment here, or message me on mastodon.

@travisjungroth
Copy link

@jaredly I think you'd need to view the property annotations like type hints. You could verify some information statically, and the relationship between them statically. My plan would be to use property based testing, which is like next-level fuzz testing, to verify it on the function. And you really hit the nail on the head with 🤞lol. I've been meaning to write up this whole idea and there's a reason I want to call it "imperfect algebra". Which, like freaks people out. But keep in mind this is a superset of normal type systems. And my issue with those is that this perfect world of types works on a very limited set of types in practice. When you get into a lot of business problems, it gives you nothing to work with. I'd rather have high probability than nothing. I haven't looked at Koka, thanks.

@vchakrav
Copy link

vchakrav commented Feb 8, 2023

@jaredly Have you looked at https://petevilter.me/post/hytradboi-lingo-talk/? There are a lot of interesting ideas in there...

@jaredly
Copy link
Author

jaredly commented Feb 8, 2023

@vchakrav oh yeah! I had a great chat with @vilterp at Strangeloop two years ago :) I've toyed with the idea of implementing the type system in datalog or similar, but at the time I couldn't figure out how to make it accommodate things like informative error reporting & type-directed name resolution. I should really go back and look at it now though, as this new incremental approach might be be easier to slot into datalog rules.

@rsimmons
Copy link

How are type annotations updated when expressions are removed? It seems very problematic if they don't get updated (e.g. you start to use a parameter one way, then realize you made a mistake and change how you use it). And the answer can't be just "roll back to a previous annotation" because you might not delete expressions in the reverse of the order that you added them.

Exploring an example: Say that inside your function definition, you have 3 expressions (call them A, B, C) that each further narrow the type of a parameter P. And say that you added the expressions in the order A, B, C. Now you delete expression A. To get the new type for P given usages B and C, it seems like the IDE/compiler needs to reassess P given the definition in its entirety, there is no way to do it incrementally. And then if you already need the ability to infer P's type from scratch given a function definition (without edit history), is there any advantage in the notion of incremental narrowing? (other than perhaps as a performance optimization)

I might be missing something, but my intuition is that "stickiness" creates some new challenges.

(Looking over HN comments now, I think my concern is similar to brundolf's. And I agree with them on "I think the desire to show the inferred types is orthogonal to the desire to persist them on disk, in the source-of-truth code files.")

@jaredly
Copy link
Author

jaredly commented Feb 10, 2023

@rsimmons so, deleting expression A doesn't invalidate the type of P, right? if it was valid before for all 3 expressions, it'll be valid for 2? The only "problem" is that it won't be "optimally broad", which is something you can definitely go in and fix (change the type annotation by hand) if you really did want to re-broaden it.
It'll be interesting to see how often this comes up in practice 🤷 like I said elsewhere, I'm experimenting with going all in on reifying stuff in the source tree that's normally invisible/intermediate compiler state. It's possible I'll decide to back off a bit once I see how it feels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment