Skip to content

Instantly share code, notes, and snippets.

@jaredly
Last active February 10, 2023 18:58
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jaredly/13abec33c7576c4636ca87039a999cf2 to your computer and use it in GitHub Desktop.
Save jaredly/13abec33c7576c4636ca87039a999cf2 to your computer and use it in GitHub Desktop.
Type inference that sticks

Type inference that sticks

What if type inference felt like a real-time conversation, instead of emailing code back and forth?

With type inference, compilers are faced with a difficult task: intuiting what types a user had in mind (but didn't write down) while writing some code. It all falls apart when the types that the algorithm assigns are different from those the user expected, and especially so when the type error reported is far away from the site of mental mismatch.

Amnesiac type inference

Current type inference algorithms perform just as well (and just as poorly) on code entered & modified over time in an IDE as they do on a text file read from a disk. Or, for that matter, on code written with pen & paper and then fed to a compiler character by character via teletype.

That is to say, they have no knowledge of, or interest in, the process by which the code was written. The only feedback loop they recognize is receiving all of the code at once, and responding with the first error they come across, or if you're lucky, several independent errors. Fixing a given error, instead of being a context-sensitive conversation ("here's the resolution to that error, please continue"), is back to a clean-slate data dump of all the code (modulo some caching for performance, but caching does not -- must not -- change the semantics of type inference).

To annotate or not to annotate

When tracking down a hard-to-find type error, one solution is to start nailing things down: add in the types that you had in mind in the first place, and see where the compiler got tripped up. On the other hand, you could have just written in those types all along, and saved yourself the trouble.

So there's this tradeoff between brevity and verbosity of the language syntax. The more information included in the source code, the less guess-work the compiler / type checker has to do, and the less chance there is for a mismatch between the programmer's mental model and the computer's. But more verbosity results in not only more typing, but it can also disrupt the flow of writing code. In an exploratory mode, the user doesn't necessarily know the types of function arguments before they have written the body of the function. And then there's readability tradeoffs on both sides of the spectrum: for experienced programmers, having types written explicitly all over the place can be very noisy and actually inhibit code comprehension, while those same annotations are a boon for someone unfamiliar with the codebase.

There are also times when inference algorithm limitations have negative impacts on other aspects of the language. Flow, facebook's type checker for javascript, originally had global type inference, but in the interest of speed (especially on large codebases such as facebook has) recent versions require you to annotate all functions exported from a file, even when those functions have trivially inferrable types. Ocaml's type inference covers most of the language, but type-abstracted modules (known as Functors) generally cannot be inferred, and need explicitly type annotations.

Roc, on the other hand, follows Elm's admirable commitment to having all type annotations be optional, such that any correct program can have all of the annotations removed and still have the same semantics. The trade-off there is that Roc doesn't support Rank-2 polymorphism (you can't pass a generic function as a function argument), because it wouldn't be able to be fully inferred. Haskell on the other hand, never one to compromise on power, allows you to turn on "Rank-N" types, under which type inference is undecideable in the general case.

A sufficiently interactive compiler

With jerd, I'm trying to build a compiler that's "sufficiently interactive" instead of sufficiently smart. Instead of trying to reconstruct the programmer's intentions from a text-dump of source code, the compiler is involved in the entire process of development, both responding to programmer action and giving context-aware feedback in return. In this model, the source code is enriched with the compiler's inferences along the way, which the programmer can inspect & correct (but which can also be elided from display for a more compact presentation). Type inference is therefore incremental and reified in the program source, meaning that, once a function has been written, the compiler need make no guesses about types or the identities of referenced terms.

As a simple example, consider autocomplete. Isn't it funny that autocomplete can enable you to choose exactly what function you want to call, and what local variable you want to reference, but then when the compiler comes along all it has to go on is the autocompleted text? It has to duplicate all of the work autocomplete did, constructing local scopes and performing top-level name resolution, and if the term has since been renamed then it just falls over.

When typing an identifier in jerd, if you select something from autocomplete, the ID of that term is stuck onto the identifier node (e.g. myFn#4ea8ff). This way, the compilation pass doesn't need to do any name resolution -- it knows exactly what is being referenced. And, if you later change the name of a variable, none of the references are broken. (In jerd, references are by a hash of the definition's syntax tree instead of by name, similar to what unison does.)

Sticky type inference

Say you're writing a function with two arguments. Before the body is defined, the arguments have the universal type "𝕌".

The interface looks like this:

(defn movieFromLine [line idx]
    )

But the underlying data has reified ("sticky") type information for all identifiers, something like this:

(defn movieFromLine [line {id: 0, type: 𝕌}
                     idx  {id: 1, type: 𝕌}]
    )

Then, when you use one of the arguments, the type annotation for that argument gets updated, if possible, to the intersection of the current annotation and the type required by the new usage.

What you see:

(defn movieFromLine [line idx]
    (switch (split line "!")
        ))

What the compiler sees (again, this isn't inferred by the compiler, it's actually persisted in the source tree):

(defn movieFromLine [line {id: 0, type: string}
                     idx  {id: 1, type: 𝕌}]
    (switch (split {id: '#ead31a8',
                    type: (fn [string string] (array string))}
                line {id: 0}
                "!")
        ))

If there is no intersection, then the new usage is marked as an error, but you're also given the option (via dropdown) to switch the annotation to align with your new usage, and marking any other incompatible usages as erroneous.

See how it's a conversation?

This functionality is intimitely connected to the fact that jerd is a projectional language, and that there is no provision for editing source code as "flat text". Because of this, the compiler can rely on having much more information persisted in the source tree, which would otherwise be provibitively verbose if it had to be printed & parsed as syntax. Similarly, interactions with the compiler are no longer text-first with optional enhancement by a separate IDE; the compilation process is interactive from the ground up, such that ambiguity doesn't have to be an error -- it can just be a dropdown.

Prior Art

Lamdu's editor has a similar interaction to this, where if you add a new use of a variable that's incompatible with current uses, then the new usage is marked as the error. I don't know if they store the inferred types in the syntax tree, or if they re-compute them on every edit.

Further reading

I'm hoping to write quite a bit more about jerd as I'm fleshing out the language and development environment, but in the meantime you can take a look at the WIP documentation for the type system (which borrows quite a bit from Roc) and the algebraic effects setup, which is inspired by, but fairly different from, that of unison and eff.

This whole endeavor is quite experimental, so if you have any feedback or suggestions, I'd love to hear them 😄 leave a comment here, or message me on mastodon.

@abramclark
Copy link

Can the compiler conversation be fit into LSP so it's editor agnostic?

@jaredly
Copy link
Author

jaredly commented Feb 5, 2023

@abramclark interesting question! The Language Server Protocol is built for text-first languages, with character offsets & line/column indices. The conversation I'm imagining relies on features of a projectional language, where the source tree can contain much more information what would be practical to encode in raw text.

Additionally, the LSP is for regulating the communication between two separate parts -- the editor and the compiler -- whereas with my language (and I believe Lamdu and Hazel as well), the compiler and the editor are the same system, intimately coupled.

If projectional languages became widespread, I could imagine there being some standardization of source tree format & editor operations, such that we could develop a standard protocol, but I imagine that's pretty far off 😄.

Thanks for asking!

@travisjungroth
Copy link

Wow, it's wild how many ideas are in here that I'd put in a language if I was making it. Functional typed lisp, first-class support for code folding, unison style resolution. The idea I don't see in there that might be interesting is abstract algebra inspired types. Every type is a set, which is defined as a predicate and optionally can give its values. The type annotations on functions are what properties they have on what sets. You can shortcut by naming algebraic structures, like this is a monoid on positive integers. You can have as many annotations as you need.
Where things get weird, and maybe this wouldn't make sense for a language level construct, is allowing just totally arbitrary predicates. So in addition to traditional type theory, you have property based tests running all the time. I think this would be an amazing way to manage business logic inside of code. It's really surprising to me how poorly type systems support stuff like "this is an 8 character alphanum string". And when you use these functions, you have these strong guarantees you can reason about very quickly. Oh, this function is associative. Oh, this function is the inverse of that one.
I've been meaning to make a version in Python that reads docstrings and generates hypothesis tests and some design-by-contract stuff.

@jaredly
Copy link
Author

jaredly commented Feb 6, 2023

@travisjungroth that sounds a lot like where clojure landed with core.spec, or am I misunderstanding? If you have any concrete examples of "properties on sets" that might help clarify, I don't have the strongest type theory chops 😅. I would like to avoid either (a) having type signatures that can't actually be statically validated, or (b) implementing a theorem prover 😂.

@travisjungroth
Copy link

@jaredly Yes, it's nearly the same. The idea goes one step further to talk in terms of algebraic structures. I'm sure someone could implement this on top of core.spec or already has.
Concrete example: addition is a binary operator. It has the property of totality (or closure) in regards to integers. Add two integers and you always get an integer back. A set combined with a binary operator that is closed forms a magma. So, addition and integers form a magma. Addition is also associative for integers, so that means they make a semigroup. In the case of addition and integers we could keep going a while because they have tons of properties.
For a counter example, division isn't closed over integers and subtraction isn't associative.
For a business case: imagine you have some data structure that represents a transaction. You have a function to combines two transactions. If this function is closed, associative, and has an identity element for transactions, then it's called a monoid and you can run it in distributed reduce.
These are algebraic structures and every structure I've described so far is group-like, which I find to be the most approachable algebra structures. There are ones involving multiple sets and multiple operators.
I find knowing the mathematical properties of a function (when possible) makes it very easy to reason about. But, I don't know of any language that makes this inherently supported.

@jaredly
Copy link
Author

jaredly commented Feb 6, 2023

@travisjungroth would these mathematically properties be statically verified by the language? I believe coq can do such things, but it's a theorem prover, and extremely cumbersome to write 😬 alternatively, fuzz testing could give you a "this is probably associative 🤞".
Or I guess I can imagine a system that can guarantee such properties as long as the function implementation isn't too complex, and if it is then it just refuses to validate the properties; you'd have to rewrite the function to be simpler.
Have you looked at Koka? Some of their builtin effects are mathematical in nature, and might be a starting place for what you're looking for.

@travisjungroth
Copy link

@jaredly I think you'd need to view the property annotations like type hints. You could verify some information statically, and the relationship between them statically. My plan would be to use property based testing, which is like next-level fuzz testing, to verify it on the function. And you really hit the nail on the head with 🤞lol. I've been meaning to write up this whole idea and there's a reason I want to call it "imperfect algebra". Which, like freaks people out. But keep in mind this is a superset of normal type systems. And my issue with those is that this perfect world of types works on a very limited set of types in practice. When you get into a lot of business problems, it gives you nothing to work with. I'd rather have high probability than nothing. I haven't looked at Koka, thanks.

@vchakrav
Copy link

vchakrav commented Feb 8, 2023

@jaredly Have you looked at https://petevilter.me/post/hytradboi-lingo-talk/? There are a lot of interesting ideas in there...

@jaredly
Copy link
Author

jaredly commented Feb 8, 2023

@vchakrav oh yeah! I had a great chat with @vilterp at Strangeloop two years ago :) I've toyed with the idea of implementing the type system in datalog or similar, but at the time I couldn't figure out how to make it accommodate things like informative error reporting & type-directed name resolution. I should really go back and look at it now though, as this new incremental approach might be be easier to slot into datalog rules.

@rsimmons
Copy link

How are type annotations updated when expressions are removed? It seems very problematic if they don't get updated (e.g. you start to use a parameter one way, then realize you made a mistake and change how you use it). And the answer can't be just "roll back to a previous annotation" because you might not delete expressions in the reverse of the order that you added them.

Exploring an example: Say that inside your function definition, you have 3 expressions (call them A, B, C) that each further narrow the type of a parameter P. And say that you added the expressions in the order A, B, C. Now you delete expression A. To get the new type for P given usages B and C, it seems like the IDE/compiler needs to reassess P given the definition in its entirety, there is no way to do it incrementally. And then if you already need the ability to infer P's type from scratch given a function definition (without edit history), is there any advantage in the notion of incremental narrowing? (other than perhaps as a performance optimization)

I might be missing something, but my intuition is that "stickiness" creates some new challenges.

(Looking over HN comments now, I think my concern is similar to brundolf's. And I agree with them on "I think the desire to show the inferred types is orthogonal to the desire to persist them on disk, in the source-of-truth code files.")

@jaredly
Copy link
Author

jaredly commented Feb 10, 2023

@rsimmons so, deleting expression A doesn't invalidate the type of P, right? if it was valid before for all 3 expressions, it'll be valid for 2? The only "problem" is that it won't be "optimally broad", which is something you can definitely go in and fix (change the type annotation by hand) if you really did want to re-broaden it.
It'll be interesting to see how often this comes up in practice 🤷 like I said elsewhere, I'm experimenting with going all in on reifying stuff in the source tree that's normally invisible/intermediate compiler state. It's possible I'll decide to back off a bit once I see how it feels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment