MoarVM's optimizer can perform speculative optimization. It does this by gathering statistics as the program is interpreted, and then analysing them to find out what types and callees typically show up at given points in the program. If it spots there is at least a 99% chance of a particular type showing up at a particular program point, then it will optimize the code ahead of that point as if that type would always show up.
Of course, statistics aren't proofs. What about the 1% case? To handle this, a guard instruction is inserted. This cheaply checks if the type is the expected one, and if it isn't, falls back to the interpreter. This process is known as deoptimization.
I just stated that a guard cheaply checks if the type is the expected one, but just how cheap is it really? There's both direct and indirect costs.
The direct cost is that of the check. Here's a (simplified) version of the JIT compiler code that produces the machine code for a type guard.
/* Load object that we should guard */
| mov TMP1, WORK[obj];
/* Get type table we expect and compare it with the object's one */
MVMint16 spesh_idx = guard->ins->operands[2].lit_i16;
| get_spesh_slot TMP2, spesh_idx;
| cmp TMP2, OBJECT:TMP1->st;
| jne >1;
/* We're good, no need to deopt */
| jmp >2;
|1:
/* Call deoptimization handler */
| mov ARG1, TC;
| mov ARG2, guard->deopt_offset;
| mov ARG3, guard->deopt_target;
| callp &MVM_spesh_deopt_one_direct;
/* Jump out to the interpreter */
| jmp ->exit;
|2:
Where get_spesh_slot
is a macro like this:
|.macro get_spesh_slot, reg, idx;
| mov reg, TC->cur_frame;
| mov reg, FRAME:reg->effective_spesh_slots;
| mov reg, OBJECTPTR:reg[idx];
|.endmacro
So, in the case that the guard matches, it's 7 machine instructions (note: it's actually a couple more because of something I omitted for simplicity). Further, one is a conditional jump. We'd expect it to be false most of the time, and so the CPU's branch predictor should get a good hit rate - but the branch predictor usage isn't entirely free of charge either. Effectively, it's not that bad, but it's nice to save the cost if we can.
The indirect costs are much harder to quantify. In order to deoptimize, we need to have enough state to recreate the world as the interpreter expects it to be. I wrote on this topic not so long ago, for those who want to dive into the detail, but the essence of the problem is that we may have to retain some instructions and/or forgo some optimizations so that we are able to successfully deoptimize if needed. Thus, the presence of a guard constrains what optimizations we can perform in the code around it.
A guard instruction in MoarVM originally looked like:
sp_guard r(obj) sslot uint32
Where r(obj)
is an object register to read containing the object to guard,
the sslot
is a spesh slot (an entry in a per-block constant table) containing
the type we expect to see, and a uint32
indicating the target address after
we deoptimize. Guards would be inserted after instructions for which we had
gathered statistics and determined there was a stable type. Things guarded
include return values after a call, reads of object attributes, and reads of
lexical variables.
This design carried us a long way, however it has a major shortcoming. The program is represented in SSA form. Thus, an invoke followed by a guard might look something like:
invoke r6(5), r4(2)
sp_guard r6(5), sslot(42), litui32(64)
Where r6(5)
has the return value written into it (and thus is a new SSA
version of r6
). We hold facts about a value (if it has a known type, if it
has a known value, etc.) per SSA version. So the facts about r6(5)
would be
that it has a known type - the one that is asserted by the guard.
The invoke itself, however, might be optimized by performing inlining of the
callee. In some cases, we might then know the type of result that the inlinee
produces - either because there was a guard there, or because we can actually
prove it! However, since the facts about r6(5)
were those produced by the
guard, there as no way to talk about what we know of r6(5)
before the guard
and after the guard.
More awkwardly, while in the early days of the specializer we only ever put guards immediately after the instructions that read values, more recent additions might insert them at a distance (for example, in speculative call optimizations and around spesh plugins). In this case, we could not safely set facts on the guarded register, however, because those might lead to wrong optimizations being done prior to the guard.
Now a guard instruction looks like this:
sp_guard w(obj) r(obj) sslot uint32
Or, concretely:
invoke r6(5), r4(2)
sp_guard r6(6), r6(5), sslot(42), litui32(64)
That is to say, it introduces a new SSA version. This means that we get a
way to talk about the value both before and after the guard instruction. Thus,
if we perform an inlining and we know exactly what type it will return, then
that type information will flow into the input - in our example, r6(5)
- of
the guard instruction. We can then notice that the property the guard wants to
assert is already upheld, and replace the guard with a simple set
(which may
itself be eliminated by later optimizations).
This also solves the problem with guards inserted away from the original write of the value: we get a new SSA version beyond the guard point. This in turn leads to more opportunities to avoid repeated guards beyond that point.
It turns out that return value guards simply go away after this change. For
example, in $a + $b
, where $a
and $b
are Int
, we will be able to
inline the +
operator, and we can statically see from its code that it will
produce an Int
. Thus, the guard on the return type in the caller of the
operator can go away. This saves the instructions associated with the guard,
and potentially allows for further optimizations to take place since we know
we'll never deoptimize at that point.
MoarVM does lots of speculative optimization. This enables us to optimize in cases where we can't prove a property of the program, but statistics tell us that is mostly behaves in a certain way. We make this safe by adding guards, and falling back to the general version of the code if cases where they fail.
However, guards have a cost. By changing our representation of them, so that we model the data coming into the guard and after the guard as two different SSA versions, we are able to eliminate many guard instructions. This not only reduces duplicate guards, but also allows for elimination of guards when the broader view afforded by inlining lets us prove properties that we weren't previously able to.
In fact, upcoming work on escape analysis and scalar replacement will allow
us to start seeing into currently opaque structures like Scalar
containers.
When we are able to do that, then we'll be able to prove further program
properties, leading to the elimination of yet more guards. Thus, this work is
not only immediately useful, but also will help us with proving more program
properties in the future.