kripken/NonNullableLocals.markdown

## NonNullableLocals.markdown

      
    Raw
  

              NonNullableLocals.markdown
            
          
    Options for Non-Nullable Locals in Wasm GC

This document attempts to summarize the current state of discussions around non-nullable locals in Wasm GC, as of late June 2022. To do so I re-read in their entirety the main thread on this topic as well as adjacent ones, and this document is an attempt to compare the main options being discussed and their tradeoffs, with links to the relevant parts of those discussions. This is far from a perfect document, in particular because I do not completely grasp some of the more subtle points being debated. But hopefully this document is helpful as a summary of where we are at a high level. Please let me know if I missed or misunderstood anything!
Background (that might be mostly useful for people not already involved in the discussions): Why do we want non-nullabile locals in wasm at all? In theory we could imagine that Wasm GC would do the same as linear memory, where we mostly assume that bounds checks are "free" thanks to the signal handler trick. That has served wasm well so far even if not all VMs can use signal handlers (like on 32-bit machines). But that approach is less applicable with Wasm GC. The key difference is that if we use the address of a trapping location (like 0) for the null object then VM GC code would need to be aware of that when scanning (we don't want traps during GC!), and doing so is complex in a modern GC that is generational, incremental, etc., and has support for multiple compiler tiers and so forth. Therefore, unlike for linear memory, we would like to find a way to optimize away safety checks in Wasm GC¹. Adding non-nullable locals is a natural way to achieve that goal: a local declared as non-nullable does not need to be checked for null.
Note that non-nullable types are straightforward to add to Wasm GC, while non-nullable locals are harder, because wasm locals are initialized to a default value, but non-nullable locals have no default². (Worse, there might not be a natural default for them at all!) Therefore if we want non-nullable locals then we need a way to ensure that they are initialized before they are used.
Originally we considered a let instruction as a solution for that problem. let would define a block, and set a local, and within that block all local.get operations would be proven to not access the null value. However, feedback from multiple VMs and others about the complexity of implementing let led us to consider other options. After lengthy debate, some of the main options remaining are the following four:

defer: We don't ship non-nullable locals in the GC MVP. For now, non-nullable types can appear in function parameters, struct fields, etc., but not in locals. We can look into adding non-nullable locals later.
local.get_or_trap+NNL³:

A new instruction local.get_or_trap is added that does the equivalent of a local.get followed by an immediate ref.as_non_null, i.e., it traps if the local contains null.
Non-nullable locals are allowed. A non-nullable local can only be read using local.get_or_trap - plain local.get will not validate on locals (but it could be allowed on parameters). (Also local.set/tee of such a local must write a non-nullable value, of course).


1b: Non-nullable locals are allowed. Block annotations are added to indicate which locals are set, allowing validation that all gets are reached after a set.
1c: Non-nullable locals are allowed. No annotation. Validation includes an inference to verify that all gets are reached after a set.

Comparison Table


defer is the baseline for others to compare to unless stated otherwise.


defer
local.get_or_trap+NNL
1b
1c


Optimized tier speed
✔️ No null checks⁴⁵
✔️ No null checks⁴⁵
✔️ No null checks⁴⁵
✔️ No null checks⁴⁵


Baseline tier speed⁶
✔️ VM can remove 90%⁵ of null checks by performing inference⁷.
✔️ VM can remove 100%⁵ of null checks by performing inference⁷.
✔️ VM can avoid 100%⁵ of null checks.
✔️ VM can avoid 100%⁵ of null checks.


Rewriting interpreter⁸ speed
(✔️ as above)
(✔️ as above)
✔️ No null checks⁵.
✔️ No null checks⁵.


In-place interpreter speed⁹
❌ Null checks.
❌ Null checks.
✔️ No null checks⁵¹⁰.
✔️ No null checks⁵¹⁰.


Binary size
-
✔️ Fewer ref.as_non_nulls.
❌ Explicit NN locals increase local counts¹¹
✔️ Fewer ref.as_non_nulls.
❌ Annotations add bytes¹².
❌ Avoiding non-validating patterns adds bytes.
❌ Explicit NN locals increase local counts¹¹.
(✔️❌ as left, but ✔️ without annotations; measured as 1.5-1.7% smaller than defer).


Validation speed
-
✔️ Smaller binary => faster (likely the fastest option without the inference from 1c as an optimization; with it, close to 1c).
✔️ Slower validation but smaller binary => faster. Might be faster than 1c if annotations save work, but measurement 📝 TBD.
✔️ Slower validation but smaller binary => faster (1.7%).


Spec complexity
✔️ Nonexistent.
✔️ Minor.
❌ Define block annotations and consider future interactions (see subsection below).
❌ Define inference algorithm and consider future interactions (see subsection below).


Non-nullable locals
❌
✔️
✔️
✔️


Complex producers
❌ Workarounds required for non-nullable locals¹³.
Replace local.get with local.get_or_trap in proper places.
❌ Must work around situations that do not fit the annotation pattern¹⁴.
❌ Finding the optimal set of annotations is not simple.
❌ Must work around situations that do not fit the inference pattern¹⁴.


Simple producers
✔️ Avoid non-nullability¹⁵ or ❌ (as above)
✔️ Avoid non-nullability¹⁵ or as above.
✔️ Avoid non-nullability¹⁵ or ❌ (as above)
✔️ Avoid non-nullability¹⁵ or ❌ (as above)


Behavior on "bad" gets that don't fit 1b/1c's model¹⁶
❌ Possible null check¹⁷ and possible trap¹⁸.
(❌ as left)
✔️ Validation error.
✔️ Validation error.


1b and 1c Spec Discussion

Spec factors regarding 1b vs 1c are still being debated, and I admit I do not follow all the points there; see e.g. here and the surrounding discussion. This subsection attempts to summarize some of the main points.
Some people oppose 1c because they see it as inconsistent with past design choices in wasm about avoiding inference, such as avoiding inference in block types, and they express a concern that allowing any inference into the spec is risky (in particular, for performance reasons, but not only). Others disagree, however, and suggest we might allow inference for certain limited things where that makes sense (in particular, where performance is shown to be sufficient).
Regarding the performance of inference, it was pointed out that 1c requires nonlinear work - the worst case is the number of locals times the number of blocks¹⁹. Looking at 1b, the amount of annotations there can be nonlinear - again, the worst case is the number of locals times the number of blocks (though, maybe we can fix that). Perhaps the main difference between 1b and 1c on this aspect is that ✔️ 1b may avoid nonlinear work in common cases, but ❌ where it is nonlinear that overhead is not just in work but also in binary size. Both have a risk of not scaling well.
Some of the debate here has been around the potential future "multiloop" construct.
Multiloop lacks structured control flow, which can make inferring properties like liveness of locals require a lengthy fixpoint analysis. Since 1a and 1b require us to validate that non-nullable locals are not used before they are set, it is natural to assume that we'll need to annotate such locals in multiloop somehow in order to avoid such overhead. If so, then it would be more consistent to also use annotations more generally as 1b does. However, the annotations in multiloop may annotate entries and not exits. They may differ in other ways as well: We do not have a full spec for multiloop yet, but I would not be surprised if multiloop requires more annotations for other things. In particular, if we annotated all locals (not just non-nullable ones) and we did so on both entries and exits, then VMs could verify liveness in linear time - without that, I worry multiloop might not be competitive in startup times if producers start to emit massive functions that way. To summarize multiloop, with 1b we may end up with similar (but not identical) types of annotations between the general case and multiloop, while with 1c we may end up with one type of annotation and one type of inference²⁰.
Overall this discussion is very much not resolved. However I think it is fair to say that most of the spec-focused people favor 1b over 1c, and that 1b may have more supporters overall (though that is only counting people active in the discussion).
Final Notes

This document's comparison of defer, local.get_or_trap+NNL, 1b, and 1c does not lead to clear conclusions:

Speed and size differences are quite small, mostly around 1-2%. Since "the purpose of non-nullable types is to improve runtime performance" (see also this discussion), having a fastest option might have helped us decide, but we don't have such an option.
The impact on producers is mixed: every approach has downsides, unfortunately, though they are all likely workable. In fact, we have working toolchains for at least two languages atm (Java and Dart) using defer, with optional support for some of the others, and no huge roadblocks we are aware of from supporting the rest.
Spec complexity and consistent design are still being debated, and I think it's fair to say there isn't an obvious best option.

It is perhaps surprising that all 4 options are so close when one of them does not even have non-nullable locals, since there were worries that "Non-nullable types without locals are mostly useless" (or that they might be "half-baked").
Those concerns have a point since there is something asymmetric about not allowing all types in all locations - it is a limitation.
However, that limitation does make sense, in the following way.
We know that removing null checks can give us significant speedups, so there is potential here, and as shown above we can unlock that potential with or without non-nullable locals.
The key point is that inferring non-nullable types in heap data and in function signatures is a difficult global problem, and so we want to leave that to toolchains (indeed, Binaryen has to
work
very
hard
there),
while inferring non-nullable types locally in a single function is very practical for VMs to do alongside their other optimizations like register allocation and so forth.
And so the lack of non-nullable locals has its own form of consistency, since it allows non-nullable types in global locations, which is exactly where we need them in order to keep consumers simple.
That leaves hard work on the producer side, as wasm always has.
Note that some of the options could be combined in the future. defer obviously just leaves a decision for later, so any option could be added post-MVP. And local.get_or_trap+NNL is a minor change to defer that is still fully extensible: If we decide on 1b or 1c later then they would open up the possibility to use local.get on non-nullable locals (and not just local.get_or_trap). That combined result would be useful since local.get_or_trap would still be needed in the rare cases where 1b/1c's model does not fit. The opposite order could also be done if we started with 1b or 1c and added the local.get_or_trap instruction later.
Overall, we are in a very good place! Non-nullable types in Wasm GC give us significant speedups, and we have (at least) 4 viable options to pick between to get those benefits.
Appendix: Previously-considered options


let (described in the introduction)
locals_initialized_barrier: Before the barrier non-nullable locals are accessed with a null check, and after it they must have all be set and can then be accessed without a check.
scope: Entering a scope does null checks on the mentioned locals, which can then be used without a null check inside that scope. A variation on this is more similiar to let, where the scope contains initialization for specified locals (with the same result, that they can then be used in the scope without checks).
local-initializers: Similar to global initializers.
refined-types: A local has a declared type, but also a "refined type" which can be more specific (and non-nullable). If tracked during validation that information can be used. A new instruction for explicit refining is another variation here. (See also here).
1a: Like 1b or 1c but only allows local.get up to the end of the current block. This does not support joins, but data suggests that is enough for 90% of cases, and it can be done in linear time.

Footnotes

Footnotes


Note that such optimizations are not without a cost, as any null check that is removed by mistake is a potential security vulnerability. That risk exists in all approaches to removing null checks, in principle. ↩


Non-nullable reference types are therefore nondefaultable types. Future proposals may add more nondefaultable types, which may not be references, which may have implications for some of the options discussed in this document (as there may not be a natural "null" value for them). ↩


Previous versions of this doc mentioned only local.get_or_trap, without the second bullet point that allows non-nullable locals. The change was made following feedback on the importance of declaring such locals. This option arose during discussion of option 6, and there a simple incremental progression from defer to local.get_or_trap to local.get_or_trap+NNL. ↩


Optimizing tier speed is the same in all options because it is practical to infer non-nullability inside functions using SSA or even simpler methods. Early V8 data did show differences, but VM improvements since then have made that go away (however see point "C" there: any VM bug or limitation that causes it to fall out of the optimizing tier will cause performance to be affected by whether the lower tier avoids null checks). ↩ ↩² ↩³ ↩⁴


"100% of null checks" means all the ones feasible to remove. Not all null checks are guaranteed to be removed by any of the above approaches (e.g., control flow paths through a try), but data suggests such cases are rare. Note also that we are only talking about null checks on locals - other null checks are unaffected by this discussion, like (struct.get (struct.get)) (where there is a null check on a value loaded from the heap). ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹²


The baseline tier speed row focuses on null checks. An additional form of overhead exists, however, which is function entry overhead: the cost of initializing the stack space that contains locals. This is not a problem for optimizing tiers because they can see which locals are always written to before they are read, and so this overhead has been present in principle since the wasm MVP, but has not been an issue there. However, on baseline compilers some overhead has been noticed. In principle non-nullability can help here, because a baseline compiler could use that information when emitting the function prologue: if it knows from its type that a local will not be used without being set first, then it could skip zeroing out the stack space for that local. However, for security reasons engines do want to initialize the memory anyhow, so the real overhead is from the complexity of mixing different types of initialization - reference locals may need to be written a different value, so if reference and MVP locals are interleaved the VM needs to emit lots of single writes instead of simple loops. But we can expect such interleaving to not happen in optimized binaries, since the producer can sort MVP types and reference types separately. As a result, function entry overhead should be identical among all the options considered here (again, at least on optimized binaries). (Note that the above might change in the future. For example, perhaps other security mitigations might be found that remove the need for zeroing out memory (hardware memory tagging perhaps?), and if so, not needing to zero out memory might be faster, which would be an advantage of 1b and 1c. However, in such a future we'd also want to remove that overhead for MVP types as well, and 1b/1c could work equally well on them. That is, if this overhead ends up mattering later then we'll want a solution that is not specific to reference types anyhow.) ↩


local.get_or_trap+NNL can literally do
the same inference as 1c as an internal optimization, with the only difference that 1c fails to validate if a null check cannot be removed while local.get_or_trap+NNL will not remove the null check specifically at that location. In comparison, defer has a harder problem to solve because locals are not defined as non-nullable - which means we need to perform inference on more locals - and nullable values may be written to them - which means we can't assume that null checks are not needed once we see a set, even a set of a non-nullable value (a nullable one may be set later, and in a loop, that can reach us). However, even defer can perform something like 1a and by unsetting the flag when a nullable value is written, which should handle 90% of cases. A second pass, or patching checks backwards, would also allow full handling of loops (since we just need to rule out the case of a nullable value ever being written later in the function - which would not happen in code that would have otherwise been compiled to a non-nullable local). ↩ ↩² ↩³


An interpreter that constructs an internal bytecode from the wasm and then interprets that (like Wasm3), or otherwise rewrites the wasm to a more convenient form. In contrast to an in-place interpreter that literally interprets the wasm in-place (like Wizard). ↩


It is worth noting that the main example we have of an in-place interpreter, i.e., one that operates on wasm in-place, is Wizard, which handles null checks using CPU hardware. That means that many null checks are essentially free. ↩


1b and 1c allow us to avoid null checks in in-place interpreters because the wasm type system directly encodes that fact. That is, if the module validates then all the null checks on non-nullable locals can be removed. In contrast, local.get_or_trap+NNL can find all the places where null checks can be removed (identically to 1c⁷), but null checks might not be removable in other places. That means that for local.get_or_trap+NNL to use that information we must apply it immediately, which is possible in a baseline compiler or rewriting interpreter by emitting code with or without null checks as we validate/infer (as production engines do), or we must store the information on the side somewhere. 1b and 1c actually embed/preserve that information in the wasm type system itself, which guarantees that all tiers can simply assume there are no null checks on non-nullable locals after validation. That can be seen as them enforcing the wasm type system in a stronger manner. ↩ ↩²


If two locals could otherwise have been coalesced into one (no overlapping live ranges), different nullability would prevent that. This is likely rare, but may be worth measuring. ↩ ↩²


Data suggests we'll need few annotations on average, but the corner cases are nonlinear. ↩


Toolchains can replace non-nullable locals with nullable ones + ref.as_non_null after each get (or use local.get_or_trap). Implementing that in Binaryen has been straightforward. On the Dart side there are some annoyances, but they are manageable. However, there is some concern that this might be more difficult in other toolchains. ↩


E.g., Binaryen ran into a problem with moving a local.set into a try. The workaround is to disable that specific optimization, which is probably not that bad because try instructions are rare at least in J2Wasm output (however, more such issues may exist). A related issue is that, in principle, 1b and 1c's annotations/inference could be extended to handle try-catch somehow, but which control flow branches exist to catches would then depend on which instructions can throw. As instructions are not annotated as throwing or not, that means that any call can throw, and so replacing a numeric instruction with a call to the exact same code might affect validation. ↩ ↩²


All simple producers also have the option to simply not use non-nullable types, or to use non-nullable types but avoid non-nullable locals, and to leave the rest for an optimizer. That may still work well since Wasm GC has enough information for an optimizer to do quite a lot (as one datapoint, a J2Wasm binary without non-nullable locals that is run through Binaryen ends up with 62% of the locals being non-nullable). ↩ ↩² ↩³ ↩⁴


That is, a get of a non-nullable local that 1b and 1c cannot validate with. That means it is not simply dominated by sets in the way that they define. Such things appear rare, luckily - if they were common that would be a problem for 1b/1c! - but examples exist, like a set in a try. ↩


VMs are free to implement something more sophisticated than 1c's inference algorithm as an internal optimization. For example they could track which instructions might throw and handle a set in a try. ↩


A trap should only happen in the case of a producer bug. ↩


Interestingly (or rather unfortunately), it turns out that the wasm spec already contains such overhead in br_if, so this would not be a precedent (but in practice, at least, br_if does not appear to cause problems) ↩


Fwiw, the latter situation is not without precedent given that wasm already has both annotated and unannotated select instructions, that is, we only annotate there when we actually have a strong need (which is the case with references, since finding the proper type is not trivial given subtyping). Similarly, 1c may lack a strong need for annotations given the performance data so far (in part because 1c only computes booleans and not something with subtypes), while multiloop will have such a need. However, perhaps one might argue that the two selects are (like the quadratic br_if issue mentioned before) a blemish on wasm that we should not repeat. ↩


## TypesPreventCoalescing.wat
;; Coalescing of locals. This is possible because the types are identical.
;; If one type was non-nullable, we couldn't do it.

(func $foo
  (local $a (ref null $X))
  (local $b (ref null $X))
  (local.set $a ..)
  (.. (local.get $a) (local.get $a))
  (local.set $b ..)
  (.. (local.get $b) (local.get $b))
)

;; =>

(func $foo
  (local $ab (ref null $X))
  (local.set $ab ..)
  (.. (local.get $ab) (local.get $ab))
  (local.set $ab ..)
  (.. (local.get $ab) (local.get $ab))
)
	`defer`	`local.get_or_trap+NNL`	`1b`	`1c`
Optimized tier speed	✔️ No null checks⁴⁵	✔️ No null checks⁴⁵	✔️ No null checks⁴⁵	✔️ No null checks⁴⁵
Baseline tier speed⁶	✔️ VM can remove 90%⁵ of null checks by performing inference⁷.	✔️ VM can remove 100%⁵ of null checks by performing inference⁷.	✔️ VM can avoid 100%⁵ of null checks.	✔️ VM can avoid 100%⁵ of null checks.
Rewriting interpreter⁸ speed	(✔️ as above)	(✔️ as above)	✔️ No null checks⁵.	✔️ No null checks⁵.
In-place interpreter speed⁹	❌ Null checks.	❌ Null checks.	✔️ No null checks⁵¹⁰.	✔️ No null checks⁵¹⁰.
Binary size	-	✔️ Fewer `ref.as_non_null`s. ❌ Explicit NN locals increase local counts¹¹	✔️ Fewer `ref.as_non_null`s. ❌ Annotations add bytes¹². ❌ Avoiding non-validating patterns adds bytes. ❌ Explicit NN locals increase local counts¹¹.	(✔️❌ as left, but ✔️ without annotations; measured as 1.5-1.7% smaller than `defer`).
Validation speed	-	✔️ Smaller binary => faster (likely the fastest option without the inference from `1c` as an optimization; with it, close to `1c`).	✔️ Slower validation but smaller binary => faster. Might be faster than `1c` if annotations save work, but measurement 📝 TBD.	✔️ Slower validation but smaller binary => faster (1.7%).
Spec complexity	✔️ Nonexistent.	✔️ Minor.	❌ Define block annotations and consider future interactions (see subsection below).	❌ Define inference algorithm and consider future interactions (see subsection below).
Non-nullable locals	❌	✔️	✔️	✔️
Complex producers	❌ Workarounds required for non-nullable locals¹³.	Replace `local.get` with `local.get_or_trap` in proper places.	❌ Must work around situations that do not fit the annotation pattern¹⁴. ❌ Finding the optimal set of annotations is not simple.	❌ Must work around situations that do not fit the inference pattern¹⁴.
Simple producers	✔️ Avoid non-nullability¹⁵ or ❌ (as above)	✔️ Avoid non-nullability¹⁵ or as above.	✔️ Avoid non-nullability¹⁵ or ❌ (as above)	✔️ Avoid non-nullability¹⁵ or ❌ (as above)
Behavior on "bad" gets that don't fit 1b/1c's model¹⁶	❌ Possible null check¹⁷ and possible trap¹⁸.	(❌ as left)	✔️ Validation error.	✔️ Validation error.
	;; Coalescing of locals. This is possible because the types are identical.
	;; If one type was non-nullable, we couldn't do it.

	(func $foo
	(local $a (ref null $X))
	(local $b (ref null $X))
	(local.set $a ..)
	(.. (local.get $a) (local.get $a))
	(local.set $b ..)
	(.. (local.get $b) (local.get $b))
	)

	;; =>

	(func $foo
	(local $ab (ref null $X))
	(local.set $ab ..)
	(.. (local.get $ab) (local.get $ab))
	(local.set $ab ..)
	(.. (local.get $ab) (local.get $ab))
	)