Finally, after considering gathering data and planning what to optimize, it's time to look at how MoarVM's dynamic optimizer, "spesh", transforms code into something that should run faster.
The name "spesh" is short for "specializer", and this nicely captures the
overall approach taken. The input bytecode the optimizer considers is full of
genericity, and an interpreter of that bytecode needs to handle all of the
cases. For example, a method call depends on the type of the invocant, and
in the general case that may vary from call to call. Worse, while most of the
time the method resolution might just involve looking in a cache, in other cases
we might have to call into find_method
to do a more dynamic dispatch. That
in turn means method lookup is potentially an invocation point, which makes it
harder to reason about the state of things after the method lookup.
Of course, in the common case, it's a boring method lookup that will be found in the cache, and in many cases there's only one invocant type that shows up at a given callsite at runtime. Therefore, we can specialize the code for this case. This means we can cache the result of the method lookup, and since that is now a constant, we may be able to optimize the call itself using that knowledge.
This specialization of method resolution is just one of many kinds of specialization that we can perform. I'll go through a bunch more of them in this post - but before that, let's take a look at how the code to optimize is represented, so that these optimizations can be performed conveniently.
The input to the optimization process is bytecode, which is just a bunch of bytes with instruction codes, register indexes, and so forth. It goes without saying that doing optimizations by manipulating that directly would be pretty horrible. So, the first thing we do is make a pass through the bytecode and turn every instruction in the bytecode into an instruction node: a C structure that points to information about the instruction, to the previous and next instructions, and contains information about the operands.
This is an improvement from the point of view of being able to transform the
code, but it doesn't help the optimizer to "understand" the code. The step
that transforms the bytecode instructions into instruction nodes also performs
a second task: it marks the start and end of each basic block. A basic block is
a sequence of instructions that will run one after the other, without any flow
control. Actually, this definition is something we've gradually tightened up
over time, and was particularly refined during my work on spesh over the summer.
By now, anything that might invoke terminates a basic block. This means that
things like findmethod
(which may sometimes invoke find_method
on the
meta-object) and decont
(which normally takes a value out of a Scalar
container, but may have to invoke a FETCH
routine on a Proxy
) terminate
basic blocks too. In short, we end up with a lot of basic blocks, which makes
life a bit harder for those of us who need to read the spesh debug output, but
experience tells that the more honest we are about just how many things might
cause an invocation, the more useful it is in terms of reasoning about the
program and optimizing it.
The basic blocks themselves are only so useful. The next step is to arrange them into a Control Flow Graph (CFG). Each basic block is represented by a C structure, and has a list of its successors and predecessors. The successors of a basic block are those basic blocks that we might end up in after this one. So, if it's a condition that may be taken or not, then there will be two successors. If it's a jump table, there may be a lot more than two.
Once we have a control flow graph, it's easy to pick out basic blocks that are not reachable from the root of the graph. These could never be reached in any way, and are therefore dead code.
One thing that we need to take special care of is exception handlers. Since
they work using dynamic scope, then they may be reached only through a callee,
but we are only assmebling the CFG for a single routine or block. We need to
make sure they are considered reachable, so they aren't deleted as dead code.
One safe and conservative way to do this is to add a dummy entry node and link
them all from it. This is what spesh did for all exception handlers, until this
summer. Now, it only does this for catch handlers. Control exception handlers
are instead made reachable by linking them into the graph as a successor to all
of the basic blocks that are in the region covered by the handler. Since these
are used for redo
, next
, and last
handling in loops, it means we get much
more precise flow contorl information for code with loops in. It also means
that control exception handlers can now be eliminated as dead code too, if all
the code they cover goes away. Making this work out meant doing some work to
remove assumptions in various areas of spesh that exception handlers were never
eliminated as dead code.
The control flow graph is a step forward, but it still falls short of letting
us do the kind of analysis we'd really like. We'd also like to be able to
propagate type information through the graph. So, if we know a register holds
an Int
, and then it is assigned into a second register, we'd like to mark
that register as also containing an Int
. The challenge here is that register
allocation during code generation wants to minimize the number of registers
that are needed - a good thing for memory use - and so a given register will
have many different - and unrelated - usages over time.
To resolve this, the instructions are transformed into Static Single
Assignment form. Each time a register is assigned to in the bytecode (this is
where the "static" comes from), it is given a new version number. Thus, the
first assignment to r2
would be named r2(1)
, the second r2(2)
, and so
forth. This helps enormously, because now information can be stored about these
renamed variables, and it's easy to access the correct information upon each
usage. Register re-use is something we no longer have to worry about in this
model.
So far so straightforward, but of course there's a catch: what happens when we have conditionals and loops?
if_i r1(1) goto LABEL1
set r2(1), r3(1)
goto LABEL2
LABEL1:
set r2(2), r4(1)
LABEL2:
# Which version of r2 reaches here???
This problem is resolved by the use of phi nodes, which "merge" the various versions at "join points":
if_i r1(1) goto LABEL1
set r2(1), r3(1)
goto LABEL2
LABEL1:
set r2(2), r4(1)
LABEL2:
PHI r2(3), r2(1), r2(2)
The placement of these phi nodes is an interesting problem, and that's where dominance comes in. Basic block B1 dominates basic block B2 if every possible path to B2 in the flow control graph must pass through B1. There's also a notion of immediate dominance, which you can think of as the "closest" dominator of a node. The immediate dominators form a tree, not a graph, and this tree turns out to be a pretty good order to explore the basic blocks in during analysis and optimization. Last but not least, there's also the concept of dominance frontiers - that is, where a node's dominance ends - and this in turn tells us where the phi nodes should be placed.
Thankfully, there's a good amount of literature on how to implement all of this
efficiently and, somewhat interestingly, the dominance algorithm MoarVM picked
is a great example of how a cubic algorithm with a really low constant factor
can beat a n*log(n)
algorithm on all relevant problem sizes (20,000 basic
blocks or so - which is a lot).
Every SSA register has a set of facts associated with it. These include known
type, known value, known to be concrete, known to be a type object, and so
forth. These provide information that is used to know what optimizations we
can do. For example, knowing the type of something can allow an istype
operation to be turned into a constant. This means that the facts of the
result of that istype
now include a known value, and that might mean a
conditional can be eliminated entirely.
Facts are, in fact, something of a relative concept in spesh. In the previous parts of this series, I discussed how statistics about the types that show up in the program are collected and aggregated. Of course, these are just statistics, not proofs. So can these statistics be used to introduce facts? It turns out they can, provided we are willing to add a guard. A guard is an instruction that cheaply checks that reality really does match up with expectations. If it does not, then deoptimization is triggered, meaning that we fall back to the slow-path code that can handle all of the cases. Beyond the guard, the facts can be assumed to be true. Guards are another area that I simplified and made more efficient during my work in the summer.
Phi nodes and facts have an interesting relationship. At a phi node, we need to
merge the facts from all of the joined values. We can only merge thsoe things
that all of the joined values agree on. For example, if we have a phi with
two inputs, and they both have a fact indicating a known type Int
, then the
result of the phi can have that fact set on it. By contrast, if one input
thinks it's an Int
and the other a Rat
, then we don't know what it is.
During my work over the summer, I also introduced the notion of a "dead writer". This is a fact that is set when the writer of a SSA register is eliminated as dead code. This means that the merge can completely ignore what that input thinks, as it will never be run. A practical example where this shows up is with optional parameters:
sub process($data, :$normalize-first) {
if $normalize-first {
normalize($data);
}
do-stuff-with($data);
}
Specializations are produced for a particular callsite, meaning that we know if
the normalize-first
argument is passed or not. Inside of the generated code,
there is a branch that sticks an Any
into $normalize-first
if no argument
is received. Before my change, the phi node would not have any facts set on it,
as the two sides disagreed. Now, it can see that one side of the branch is gone,
and the merge can keep whichever facts survive. In the case that type named
argument was not passed, then the if
can also be eliminated entirely, as
a type object is falsey.
So, time to look at some of the specializations! The first transformations that are applied relate to the instructions that bind the incoming arguments into the parameter variables that are to receive them. Specializations are keyed on callsite, and a callsite object indicates the number of arguments, whether any of them are natively typed, and - for named arguments - what names the arguments have. The argument processing code can be specialized for the callsite, by:
- Throwing out the
checkarity
instruction, if we know that the number of arguments being passed are in range. - Directly reading arguments from the args buffer when no boxing/unboxing is required, instead of calling an op that checks if a box/unbox is needed. Alternatively, if a box/unbox operation is needed (passing native arg to object parameter and vice versa), then this can be inserted directly, and may be subject to further optimization and/or JIT compilation later.
- Directly reading named arguments by index into the arguments buffer, rather than having to do any kind of hash lookup or string comparisons (this makes named arguments as cheap to receive as positionals).
- Eliminating the "check all named arguments are used" instruction, if present and it can be proven to be unrequired.
- Eliminating branches related to optional arguments.
- Generating code to build a hash directly from the args buffer contents for slurpy named arguments.
Attribute access instructions pass the name of the attribute together with the type the attribute was declared in. This is used to resolve the location of the attribute in the object. There is a static optimization that adds a hint, which has to be verified to be in range at runtime and that does not apply when multiple inheritance is used.
When the spesh facts contain the type of the object that the attribute is being looked up in, then - after some sanity checks - this can be turned into something much cheaper: a read of the memory location at an offset from the pointer to the object. Both reads and writes of object attributes can receive this treatment.
Many values in Perl 6 live inside of a Scalar
container. The decont
instruction is used to take a value out of a container (which may be a Scalar
or a Proxy
). Often, decont
is emitted when there actually is no container
(because it's either impossible or too expensive to reason about that at compile
time). When we have the facts to show it's not being used on a container type,
the decont
can be rewritten into a set
instruction. And when the facts show
that it's a Scalar
, then it can optimized into a cheap memory offset read.
When both the name of a method and the type it's being resolved on are known from the facts, then spesh looks into the method cache, which is a hash. If it finds the method, it turns the method lookup into a constant, setting the known value fact along the way. This, in the common case, saves a hash lookup per method call, which is already something, but knowing the exact method that will be run opens the door to some very powerful optimizations on the invocation of the resolved method.
Specialization of invocations - that is, calling of subs, methods, blocks, and so forth - is one of the most valuable things spesh does. There are quite a few combinations of things that can happen here, and my work over the summer made this area of spesh a good deal more powerful (and, as a result, took some time off various benchmarks).
The most important thing for spesh to know is what, exactly, is the target of
the invocation. In the best case, it will be a known value already. This is
the case with subs resolved from the setting or from UNIT
, as well as in the
case that a findmethod
was optimized into a constant. These used to be the
only cases, until I also added recording statistics on the static frame that
was the target of an invocation. These are analyzed and, if pretty stable, then
a guard can be inserted. This greatly increases the range of invocations that
can be optimized. For example, a library that can be configured with a
callback may, if the program uses a consistent callback, and the callback is
small, have the callback's code inlined into the place that calls it in the
library code.
The next step involves seeing if the target of the invocation is a multi
sub
or method. If so, then the argument types are used to do a lookup into the multi
dispatch cache. This means that spesh can potentially resolve a multi-dispatch
once at optimization time, and just save the result.
Argument types are worth dwelling on a little longer. Up until the summer, the facts were consulted to see if the types were known. In the case there was no known type from the facts, then the invocation would not be optimzied. Now, using the far richer set of statistics, when a fact is missing but the statistics suggest there is a very consistent argument type, then a guard can be inserted. This again increases the range of invocations that can be optimized.
So, by this point a multi dispatch may have been resolved, and we have something to invoke. The argument types are now used to see if there is a specialization of the target of the invocation for those arguments. Typically, if it's on a hot path, there will be. In that case, then the invocation can be rewritten to use an instruction that pre-selects the specialization. This saves a lot of time re-checking argument types, and is a powerful optimization that can reduce the cost of parameter type checks to zero.
Last, but very much not least, it may turn out that the target specialization being invoked is very small. In this case, a few more checks are performed to see if the call really does need its own invocation record (for example, it may do so if it has lexicals that will be closed over, or if it might be the root of a continuation, or it has an exit handler). Provided the conditions are met, the target may be inlined - that is, have its code copied - into the caller. This saves the cost of creating and tearing down an invocation record (also known as call frame), and also allows some further optimizations that consider the caller and callee together.
Prior to my work in the summer, it wasn't possible to inline anything that looked up lexical variables in an outer scope (that is to say, anything that was a closure). Now that restriction has been lifted, and closures can be inlined too.
Finally, it's worth noting that MoarVM also supports inlining specializations that themselves contain code that was inlined. This isn't too bad, except that in order to support deoptimization it has to also be able to do multiple levels of uninlining too. That was a slight headache to implement.
This was mentioned in passing earlier, but worth calling out again: when the value that is being evaluated in a conditional jump is known, then the branch can be elimianted. This in turn makes dead code of one side of the branch or the other. This had been in place for a while, but in my work during the summer I made it handle some further cases, and also made it more immediately remove the dead code path. That led to many SSA registers being marked with the dead writer fact, and thus led to more optimization possibilities in code that followed the branch.
Control exceptions are used for things like next
and last
. In the general
case, their target may be some stack frames away, and so they are handled by
the exception system. This walks down the stack frames to find a handler. In
some cases, however - and perhaps thanks to an inlining - the handler and the
throw may be in the same frame. In this case, the throw can be replaced with a
simple - and far cheaper - goto
instruction. We had this optimization for a
while, but it was only during the summer that I made it work when the throwing
part was in code that had been inlined, making it far more useful for Perl 6
code.
Here we've seen how spesh takes pieces of a program, which may be full of generalities, and specializes them for the actual situations that are encountered when the program is run. Since it operates at runtime, it is able to do this across compilation units, meaning that a module's code may be optimized in quite different ways according to the way a particular consumer uses it. We've also discussed the importance of guards and deoptimization in making good use of the gathered statistics.
You'll have noticed I mentioned many changes that took place "in the summer", and this work was done under a grant from The Perl Foundation. So, thanks go to them for that! Also, I'm still looking for sponsors.
In the next, and final, part of this series, we'll take a look at argument guards, which are used to efficiently pick which specialization to use when crossing from unspecialized to specialized code.