jnthn/x.md Secret

## x.md

      
    Raw
  

              x.md
            
          
    MoarVM Specializer Improvements Part 3: Optimizing Code

Finally, after considering gathering data and planning what to optimize, it's time to look at how MoarVM's dynamic optimizer, "spesh", transforms
code into something that should run faster.
Specialization

The name "spesh" is short for "specializer", and this nicely captures the
overall approach taken. The input bytecode the optimizer considers is full of
genericity, and an interpreter of that bytecode needs to handle all of the
cases. For example, a method call depends on the type of the invocant, and
in the general case that may vary from call to call. Worse, while most of the
time the method resolution might just involve looking in a cache, in other cases
we might have to call into find_method to do a more dynamic dispatch. That
in turn means method lookup is potentially an invocation point, which makes it
harder to reason about the state of things after the method lookup.
Of course, in the common case, it's a boring method lookup that will be found
in the cache, and in many cases there's only one invocant type that shows up
at a given callsite at runtime. Therefore, we can specialize the code for
this case. This means we can cache the result of the method lookup, and since
that is now a constant, we may be able to optimize the call itself using that
knowledge.
This specialization of method resolution is just one of many kinds of
specialization that we can perform. I'll go through a bunch more of them in
this post - but before that, let's take a look at how the code to optimize is
represented, so that these optimizations can be performed conveniently.
Instruction Nodes, Basic Blocks, and CFGs

The input to the optimization process is bytecode, which is just a bunch of
bytes with instruction codes, register indexes, and so forth. It goes without
saying that doing optimizations by manipulating that directly would be pretty
horrible. So, the first thing we do is make a pass through the bytecode and
turn every instruction in the bytecode into an instruction node: a C structure
that points to information about the instruction, to the previous and next
instructions, and contains information about the operands.
This is an improvement from the point of view of being able to transform the
code, but it doesn't help the optimizer to "understand" the code. The step
that transforms the bytecode instructions into instruction nodes also performs
a second task: it marks the start and end of each basic block. A basic block is
a sequence of instructions that will run one after the other, without any flow
control. Actually, this definition is something we've gradually tightened up
over time, and was particularly refined during my work on spesh over the summer.
By now, anything that might invoke terminates a basic block. This means that
things like findmethod (which may sometimes invoke find_method on the
meta-object) and decont (which normally takes a value out of a Scalar
container, but may have to invoke a FETCH routine on a Proxy) terminate
basic blocks too. In short, we end up with a lot of basic blocks, which makes
life a bit harder for those of us who need to read the spesh debug output, but
experience tells that the more honest we are about just how many things might
cause an invocation, the more useful it is in terms of reasoning about the
program and optimizing it.
The basic blocks themselves are only so useful. The next step is to arrange
them into a Control Flow Graph (CFG). Each basic block is represented by a
C structure, and has a list of its successors and predecessors. The successors
of a basic block are those basic blocks that we might end up in after this one.
So, if it's a condition that may be taken or not, then there will be two
successors. If it's a jump table, there may be a lot more than two.
Once we have a control flow graph, it's easy to pick out basic blocks that
are not reachable from the root of the graph. These could never be reached in
any way, and are therefore dead code.
One thing that we need to take special care of is exception handlers. Since
they work using dynamic scope, then they may be reached only through a callee,
but we are only assmebling the CFG for a single routine or block. We need to
make sure they are considered reachable, so they aren't deleted as dead code.
One safe and conservative way to do this is to add a dummy entry node and link
them all from it. This is what spesh did for all exception handlers, until this
summer. Now, it only does this for catch handlers. Control exception handlers
are instead made reachable by linking them into the graph as a successor to all
of the basic blocks that are in the region covered by the handler. Since these
are used for redo, next, and last handling in loops, it means we get much
more precise flow contorl information for code with loops in. It also means
that control exception handlers can now be eliminated as dead code too, if all
the code they cover goes away. Making this work out meant doing some work to
remove assumptions in various areas of spesh that exception handlers were never
eliminated as dead code.
Dominance and SSA form

The control flow graph is a step forward, but it still falls short of letting
us do the kind of analysis we'd really like. We'd also like to be able to
propagate type information through the graph. So, if we know a register holds
an Int, and then it is assigned into a second register, we'd like to mark
that register as also containing an Int. The challenge here is that register
allocation during code generation wants to minimize the number of registers
that are needed - a good thing for memory use - and so a given register will
have many different - and unrelated - usages over time.
To resolve this, the instructions are transformed into Static Single
Assignment form. Each time a register is assigned to in the bytecode (this is
where the "static" comes from), it is given a new version number. Thus, the
first assignment to r2 would be named r2(1), the second r2(2), and so
forth. This helps enormously, because now information can be stored about these
renamed variables, and it's easy to access the correct information upon each
usage. Register re-use is something we no longer have to worry about in this
model.
So far so straightforward, but of course there's a catch: what happens when we
have conditionals and loops?
  if_i r1(1) goto LABEL1
  set r2(1), r3(1)
  goto LABEL2
LABEL1:
  set r2(2), r4(1)
LABEL2:      
  # Which version of r2 reaches here???

This problem is resolved by the use of phi nodes, which "merge" the various
versions at "join points":
  if_i r1(1) goto LABEL1
  set r2(1), r3(1)
  goto LABEL2
LABEL1:
  set r2(2), r4(1)
LABEL2:      
  PHI r2(3), r2(1), r2(2)

The placement of these phi nodes is an interesting problem, and that's where
dominance comes in. Basic block B1 dominates basic block B2 if every possible
path to B2 in the flow control graph must pass through B1. There's also a notion
of immediate dominance, which you can think of as the "closest" dominator of a
node. The immediate dominators form a tree, not a graph, and this tree turns
out to be a pretty good order to explore the basic blocks in during analysis
and optimization. Last but not least, there's also the concept of dominance
frontiers - that is, where a node's dominance ends - and this in turn tells us
where the phi nodes should be placed.
Thankfully, there's a good amount of literature on how to implement all of this
efficiently and, somewhat interestingly, the dominance algorithm MoarVM picked
is a great example of how a cubic algorithm with a really low constant factor
can beat a n*log(n) algorithm on all relevant problem sizes (20,000 basic
blocks or so - which is a lot).
Facts

Every SSA register has a set of facts associated with it. These include known
type, known value, known to be concrete, known to be a type object, and so
forth. These provide information that is used to know what optimizations we
can do. For example, knowing the type of something can allow an istype
operation to be turned into a constant. This means that the facts of the
result of that istype now include a known value, and that might mean a
conditional can be eliminated entirely.
Facts are, in fact, something of a relative concept in spesh. In the previous
parts of this series, I discussed how statistics about the types that show
up in the program are collected and aggregated. Of course, these are just
statistics, not proofs. So can these statistics be used to introduce facts?
It turns out they can, provided we are willing to add a guard. A guard is an
instruction that cheaply checks that reality really does match up with
expectations. If it does not, then deoptimization is triggered, meaning that
we fall back to the slow-path code that can handle all of the cases. Beyond the
guard, the facts can be assumed to be true. Guards are another area that I
simplified and made more efficient during my work in the summer.
Phi nodes and facts have an interesting relationship. At a phi node, we need to
merge the facts from all of the joined values. We can only merge thsoe things
that all of the joined values agree on. For example, if we have a phi with
two inputs, and they both have a fact indicating a known type Int, then the
result of the phi can have that fact set on it. By contrast, if one input
thinks it's an Int and the other a Rat, then we don't know what it is.
During my work over the summer, I also introduced the notion of a "dead writer".
This is a fact that is set when the writer of a SSA register is eliminated as
dead code. This means that the merge can completely ignore what that input
thinks, as it will never be run. A practical example where this shows up is
with optional parameters:
sub process($data, :$normalize-first) {
    if $normalize-first {
        normalize($data);
    }
    do-stuff-with($data);
}

Specializations are produced for a particular callsite, meaning that we know if
the normalize-first argument is passed or not. Inside of the generated code,
there is a branch that sticks an Any into $normalize-first if no argument
is received. Before my change, the phi node would not have any facts set on it,
as the two sides disagreed. Now, it can see that one side of the branch is gone,
and the merge can keep whichever facts survive. In the case that type named
argument was not passed, then the if can also be eliminated entirely, as
a type object is falsey.
Specialization by arguments

So, time to look at some of the specializations! The first transformations that
are applied relate to the instructions that bind the incoming arguments into
the parameter variables that are to receive them. Specializations are keyed on
callsite, and a callsite object indicates the number of arguments, whether any
of them are natively typed, and - for named arguments - what names the
arguments have. The argument processing code can be specialized for the
callsite, by:

Throwing out the checkarity instruction, if we know that the number of
arguments being passed are in range.
Directly reading arguments from the args buffer when no boxing/unboxing is
required, instead of calling an op that checks if a box/unbox is needed.
Alternatively, if a box/unbox operation is needed (passing native arg to object
parameter and vice versa), then this can be inserted directly, and may be
subject to further optimization and/or JIT compilation later.
Directly reading named arguments by index into the arguments buffer, rather
than having to do any kind of hash lookup or string comparisons (this makes
named arguments as cheap to receive as positionals).
Eliminating the "check all named arguments are used" instruction, if present
and it can be proven to be unrequired.
Eliminating branches related to optional arguments.
Generating code to build a hash directly from the args buffer contents for
slurpy named arguments.

Attribute access specialization

Attribute access instructions pass the name of the attribute together with the
type the attribute was declared in. This is used to resolve the location of the
attribute in the object. There is a static optimization that adds a hint, which
has to be verified to be in range at runtime and that does not apply when
multiple inheritance is used.
When the spesh facts contain the type of the object that the attribute is being
looked up in, then - after some sanity checks - this can be turned into
something much cheaper: a read of the memory location at an offset from the
pointer to the object. Both reads and writes of object attributes can receive
this treatment.
Decontainerization specialization

Many values in Perl 6 live inside of a Scalar container. The decont
instruction is used to take a value out of a container (which may be a Scalar
or a Proxy). Often, decont is emitted when there actually is no container
(because it's either impossible or too expensive to reason about that at compile
time). When we have the facts to show it's not being used on a container type,
the decont can be rewritten into a set instruction. And when the facts show
that it's a Scalar, then it can optimized into a cheap memory offset read.
Method resolution specialization

When both the name of a method and the type it's being resolved on are known
from the facts, then spesh looks into the method cache, which is a hash. If it
finds the method, it turns the method lookup into a constant, setting the
known value fact along the way. This, in the common case, saves a hash lookup
per method call, which is already something, but knowing the exact method that
will be run opens the door to some very powerful optimizations on the invocation
of the resolved method.
Invocation specialization

Specialization of invocations - that is, calling of subs, methods, blocks, and
so forth - is one of the most valuable things spesh does. There are quite a
few combinations of things that can happen here, and my work over the summer
made this area of spesh a good deal more powerful (and, as a result, took some
time off various benchmarks).
The most important thing for spesh to know is what, exactly, is the target of
the invocation. In the best case, it will be a known value already. This is
the case with subs resolved from the setting or from UNIT, as well as in the
case that a findmethod was optimized into a constant. These used to be the
only cases, until I also added recording statistics on the static frame that
was the target of an invocation. These are analyzed and, if pretty stable, then
a guard can be inserted. This greatly increases the range of invocations that
can be optimized. For example, a library that can be configured with a
callback may, if the program uses a consistent callback, and the callback is
small, have the callback's code inlined into the place that calls it in the
library code.
The next step involves seeing if the target of the invocation is a multi sub
or method. If so, then the argument types are used to do a lookup into the multi
dispatch cache. This means that spesh can potentially resolve a multi-dispatch
once at optimization time, and just save the result.
Argument types are worth dwelling on a little longer. Up until the summer, the
facts were consulted to see if the types were known. In the case there was no
known type from the facts, then the invocation would not be optimzied. Now,
using the far richer set of statistics, when a fact is missing but the statistics
suggest there is a very consistent argument type, then a guard can be inserted.
This again increases the range of invocations that can be optimized.
So, by this point a multi dispatch may have been resolved, and we have something
to invoke. The argument types are now used to see if there is a specialization
of the target of the invocation for those arguments. Typically, if it's on a
hot path, there will be. In that case, then the invocation can be rewritten to
use an instruction that pre-selects the specialization. This saves a lot of
time re-checking argument types, and is a powerful optimization that can
reduce the cost of parameter type checks to zero.
Last, but very much not least, it may turn out that the target specialization
being invoked is very small. In this case, a few more checks are performed to
see if the call really does need its own invocation record (for example, it may
do so if it has lexicals that will be closed over, or if it might be the root
of a continuation, or it has an exit handler). Provided the conditions are
met, the target may be inlined - that is, have its code copied - into the
caller. This saves the cost of creating and tearing down an invocation record
(also known as call frame), and also allows some further optimizations that
consider the caller and callee together.
Prior to my work in the summer, it wasn't possible to inline anything that
looked up lexical variables in an outer scope (that is to say, anything that
was a closure). Now that restriction has been lifted, and closures can be inlined
too.
Finally, it's worth noting that MoarVM also supports inlining specializations
that themselves contain code that was inlined. This isn't too bad, except
that in order to support deoptimization it has to also be able to do multiple
levels of uninlining too. That was a slight headache to implement.
Conditional specialization

This was mentioned in passing earlier, but worth calling out again: when the
value that is being evaluated in a conditional jump is known, then the branch
can be elimianted. This in turn makes dead code of one side of the branch or
the other. This had been in place for a while, but in my work during the summer
I made it handle some further cases, and also made it more immediately remove
the dead code path. That led to many SSA registers being marked with the dead
writer fact, and thus led to more optimization possibilities in code that
followed the branch.
Control exception specialization

Control exceptions are used for things like next and last. In the general
case, their target may be some stack frames away, and so they are handled by
the exception system. This walks down the stack frames to find a handler. In
some cases, however - and perhaps thanks to an inlining - the handler and the
throw may be in the same frame. In this case, the throw can be replaced with a
simple - and far cheaper - goto instruction. We had this optimization for a
while, but it was only during the summer that I made it work when the throwing
part was in code that had been inlined, making it far more useful for Perl 6
code.
In summary

Here we've seen how spesh takes pieces of a program, which may be full of
generalities, and specializes them for the actual situations that are
encountered when the program is run. Since it operates at runtime, it is able
to do this across compilation units, meaning that a module's code may be
optimized in quite different ways according to the way a particular consumer
uses it. We've also discussed the importance of guards and deoptimization in
making good use of the gathered statistics.
You'll have noticed I mentioned many changes that took place "in the summer",
and this work was done under a grant from The Perl Foundation. So, thanks go to
them for that! Also, I'm still looking for sponsors.
In the next, and final, part of this series, we'll take a look at argument
guards, which are used to efficiently pick which specialization to use when
crossing from unspecialized to specialized code.