jnthn/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Eliminating unrequired guards

MoarVM's optimizer can perform speculative optimization. It does this by
gathering statistics as the program is interpreted, and then analysing them
to find out what types and callees typically show up at given points in the
program. If it spots there is at least a 99% chance of a particular type
showing up at a particular program point, then it will optimize the code
ahead of that point as if that type would always show up.
Of course, statistics aren't proofs. What about the 1% case? To handle this,
a guard instruction is inserted. This cheaply checks if the type is the
expected one, and if it isn't, falls back to the interpreter. This process is
known as deoptimization.
Just how cheap are guards?

I just stated that a guard cheaply checks if the type is the expected one, but
just how cheap is it really? There's both direct and indirect costs.
The direct cost is that of the check. Here's a (simplified) version of the JIT
compiler code that produces the machine code for a type guard.
/* Load object that we should guard */
| mov TMP1, WORK[obj];
/* Get type table we expect and compare it with the object's one */
MVMint16 spesh_idx = guard->ins->operands[2].lit_i16;
| get_spesh_slot TMP2, spesh_idx;
| cmp TMP2, OBJECT:TMP1->st;
| jne >1;
/* We're good, no need to deopt */
| jmp >2;
|1:
/* Call deoptimization handler */
| mov ARG1, TC;
| mov ARG2, guard->deopt_offset;
| mov ARG3, guard->deopt_target;
| callp &MVM_spesh_deopt_one_direct;
/* Jump out to the interpreter */
| jmp ->exit;
|2:

Where get_spesh_slot is a macro like this:
|.macro get_spesh_slot, reg, idx;
| mov reg, TC->cur_frame;
| mov reg, FRAME:reg->effective_spesh_slots;
| mov reg, OBJECTPTR:reg[idx];
|.endmacro

So, in the case that the guard matches, it's 7 machine instructions (note:
it's actually a couple more because of something I omitted for simplicity).
Further, one is a conditional jump. We'd expect it to be false most of the
time, and so the CPU's branch predictor should get a good hit rate - but the
branch predictor usage isn't entirely free of charge either. Effectively, it's
not that bad, but it's nice to save the cost if we can.
The indirect costs are much harder to quantify. In order to deoptimize, we
need to have enough state to recreate the world as the interpreter expects it
to be. I wrote on this topic
not so long ago, for those who want to dive into the detail, but the essence
of the problem is that we may have to retain some instructions and/or forgo
some optimizations so that we are able to successfully deoptimize if needed.
Thus, the presence of a guard constrains what optimizations we can perform
in the code around it.
Representation problems

A guard instruction in MoarVM originally looked like:
sp_guard r(obj) sslot uint32

Where r(obj) is an object register to read containing the object to guard,
the sslot is a spesh slot (an entry in a per-block constant table) containing
the type we expect to see, and a uint32 indicating the target address after
we deoptimize. Guards would be inserted after instructions for which we had
gathered statistics and determined there was a stable type. Things guarded
include return values after a call, reads of object attributes, and reads of
lexical variables.
This design carried us a long way, however it has a major shortcoming. The
program is represented in SSA form. Thus, an invoke followed by a guard might
look something like:
invoke r6(5), r4(2)
sp_guard r6(5), sslot(42), litui32(64)

Where r6(5) has the return value written into it (and thus is a new SSA
version of r6). We hold facts about a value (if it has a known type, if it
has a known value, etc.) per SSA version. So the facts about r6(5) would be
that it has a known type - the one that is asserted by the guard.
The invoke itself, however, might be optimized by performing inlining of the
callee. In some cases, we might then know the type of result that the inlinee
produces - either because there was a guard there, or because we can actually
prove it! However, since the facts about r6(5) were those produced by the
guard, there as no way to talk about what we know of r6(5) before the guard
and after the guard.
More awkwardly, while in the early days of the specializer we only ever put
guards immediately after the instructions that read values, more recent
additions might insert them at a distance (for example, in speculative call
optimizations and around spesh plugins). In this case, we could not safely set
facts on the guarded register, however, because those might lead to wrong
optimizations being done prior to the guard.
Changing of the guards

Now a guard instruction looks like this:
sp_guard w(obj) r(obj) sslot uint32

Or, concretely:
invoke r6(5), r4(2)
sp_guard r6(6), r6(5), sslot(42), litui32(64)

That is to say, it introduces a new SSA version. This means that we get a
way to talk about the value both before and after the guard instruction. Thus,
if we perform an inlining and we know exactly what type it will return, then
that type information will flow into the input - in our example, r6(5) - of
the guard instruction. We can then notice that the property the guard wants to
assert is already upheld, and replace the guard with a simple set (which may
itself be eliminated by later optimizations).
This also solves the problem with guards inserted away from the original write
of the value: we get a new SSA version beyond the guard point. This in turn
leads to more opportunities to avoid repeated guards beyond that point.
It turns out that return value guards simply go away after this change. For
example, in $a + $b, where $a and $b are Int, we will be able to
inline the + operator, and we can statically see from its code that it will
produce an Int. Thus, the guard on the return type in the caller of the
operator can go away. This saves the instructions associated with the guard,
and potentially allows for further optimizations to take place since we know
we'll never deoptimize at that point.
In summary

MoarVM does lots of speculative optimization. This enables us to optimize in
cases where we can't prove a property of the program, but statistics tell us
that is mostly behaves in a certain way. We make this safe by adding guards,
and falling back to the general version of the code if cases where they fail.
However, guards have a cost. By changing our representation of them, so that
we model the data coming into the guard and after the guard as two different
SSA versions, we are able to eliminate many guard instructions. This not only
reduces duplicate guards, but also allows for elimination of guards when the
broader view afforded by inlining lets us prove properties that we weren't
previously able to.
In fact, upcoming work on escape analysis and scalar replacement will allow
us to start seeing into currently opaque structures like Scalar containers.
When we are able to do that, then we'll be able to prove further program
properties, leading to the elimination of yet more guards. Thus, this work is
not only immediately useful, but also will help us with proving more program
properties in the future.