Skip to content

Instantly share code, notes, and snippets.

@MultisampledNight
Created March 23, 2024 21:08
Show Gist options
  • Save MultisampledNight/4a60171be1d8a1a1bbc0f5674179a479 to your computer and use it in GitHub Desktop.
Save MultisampledNight/4a60171be1d8a1a1bbc0f5674179a479 to your computer and use it in GitHub Desktop.

Codified https://www.youtube.com/watch?v=N6b44kMS6OM from 2019

Traditionally

A compiler has a pipeline of passes:

  • ==lexing== the ==source== to get ==tokens==
  • ==parsing== the ==tokens== to get the ==AST==
  • semantically analyzing the ==AST==, type checking the ==AST==
  • applying optimizations
  • ...

$\implies$ was how rustc was written (and still kinda is)

The past vs. today

  • Previously: Compiler runs and we see output
  • Today: Much more dynamically and integrated into the IDE
    • LSP made that possible "easily"

$\implies$ compiler as an actor required rather than a batch mode processor $\implies$ need to respond as quickly as possible

Why would one care about IDEs?

  • Code needs to be written somewhere, it won't magically come out of the void
  • Dependencies matter, it might help language design
    • For example, in Rust it's possible to have a struct definition and impls inside a function, which are invisible to the outside
    • Which also means that impls of outer structs are possible
    • And those impls propagate to the outside
    • $\implies$ All function bodies might contain impls for outer types
  • Strictly separating phases can never work anyway
    • See const expressions for array lengths, for example
      • Which involve type checks and eval again
    • In the case of the JVM, lazy class file loading
    • Macros
    • Inferred types across function boundaries
    • Specialization
      • Some traits need to be solved, some not

Error handling

  • Errors should not just be throw new TypeError()
  • They should also not be if (error) { return semi_fake_but_legal_value }
  • Instead: Sentinel value that signifies "bad user code is here"
    • If this sentinel value is seen, its error has already been reported
    • No further reporting necessary, just propagate the sentinel up the chain
  • There is no "fallible" compiler operation
    • It always succeeds
    • But it might produce an "error" value
    • E.g. type enums or the like also need to have an Error variant
  • Actual error reporting is then done through side channels

Cycles

Optimizations: Inlining

  • MIR
    • First compiled to unoptimized version
    • Walk over all call sites
    • Get the optimized version of every function
    • Inline it
  • Works fine as long as not panicking
    • $\implies$ Instead of naively trying to get the optimized MIR, just ignore it if it's the same

Non-deterministic results

  • Two functions that call each other, foo and bar
    • Might error in one case while trying to inline
    • Might error in the other
  • How the heck does one solve it?
    • "Master" query that resolves the whole graph
    • Which outputs the list of [[Clique|strongly connected components]] to process

Other cycle cases

  • Name resolution
  • Trait resolution

Tracking of location information

  • "Spans"
  • Editing a comment does affect something, namely where errors need to be displayed
  • Different approaches
    • Concatenating all input files into one string, indexed by 32-bit indices
      • Compact but not nice for incremental compilation
    • Solve offsets from previous siblings and length in the AST node. Track only the AST node ID when referenced, and only recompute the starting offset when needed

Contents of the AST

  • Also whitespace and comments
  • Compiler should just ignore it
  • Useful for formatting
  • If not provided, can be worked around if spans and input files are present (which is what rustfmt does)

Incremental re-parsing

  • ASTs don't need to be parsed fully
  • If a compiler receives a diff of what's changed, it can reparse only those

Example: Swift ==red== and ==black== tree layers

  • Not related to [[red-black tree]]
  • ==Black== tree layer
    • Keeps all the information and details
    • Doesn't have lengths of AST points, so is only relative to the current point
    • $\implies$ Can be reused
  • ==Red== tree layer
    • Works ontop of the ==black== tree layer to actually form the whole program

Conclusions

  • Start with an on-demand style
  • Best practices are still unsure and need to be codified
    • How to represent ASTs?
    • How to represent location information?
    • How to handle cycles?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment