For a few years already, MoarVM "spesh" (short for specializer) has been helping Perl 6 code run faster. By looking at what types show up at runtime, it can then produce specialized versions for those types with a bunch of the late binding stripped out. This includes stripping out type checks (in general, but especially in argument binding). It also does a bunch of other optimizations on the code. On x64, this specialized bytecode can then be turned into machine code ("JITted").
This can make a big difference to program performance. However, nice as that is, it's only doing a fraction of what it could do, and there's various ways that it could be doing the things it already does do better. This document tries to collect together the things we're keen to improve in an upcoming round of work.
Even light optimization followed by JIT-compilation can give a decent speed improvement over interpreted and fully unoptimized code. Today, however, it's all or nothing: we either produce a type specialized version (and we can only have a limited number of those) or we just interpret. And we can only have a limited number of specialized versions.
This means that code with a lot of type variance in the incoming arguments
tends to end up being interpreted all the time, having exhausted the limited
number of specializations we're willing to produce. The lack of matching
specialized versions also means no inlining happens. This is very much the
case for methods in types like Mu
in Perl 6, such as ACCEPTS
, defined
,
and sink
. Since they have are called on a huge range of types, we quickly
run out of specialization space for the methods. This means many calls to them
are forced to run in the interpreter. Worse, they can't be inlined, despite
being otherwise incredibly attractive inlining targets.
Multi-core hardware is common today. However, the current spesh and JIT do not
take advantage of it, instead interrupting the running program to do their
work. Were it run on a separate thread, then there would be no need to block
execution of user code, allowing greater throughput. It would also remove the
pauses introduced to user code to produce specializations. Further, when many
start
blocks are launched with the same code running, we will often end up
with many threads competing to do the specializations needed, which leads to
throw-away work. A separate thread dedicated to doing the optimization work
would eliminate this problem also.
Some opportunities are missed because the inlined code is not optimized using
knowledge of the calling code that it was inlined in to. For example, things
guarded in the caller may be re-guarded in the inlinee. Once we get to the
point of having a baseline optimized version, this will become even more
important to solve. Consider inlining of Perl 6 Mu.defined
. In many cases
the caller will already have guards in place against the concreteness of the
argument. Were this knowledge propagated into the inlinee, the test would turn
into a constant. Were that result further propagated out of the inlinee, it
could in turn lead to elimination of a conditional and potentially deletion of
entire basic blocks. Those in turn could lead to PHI nodes dropping to a merge
of one register, meaning precise information propagation, which would in turn
allow further optimization downstream. But all of this is missed when we just
treat inlined code as a blob of stuff we splice in.
This makes various things easier, but it also forces a pair of branches at either end of the inlinee. They're predictable, but it's a waste of branch prediction logic. We should see what happens if the inlined code is instead stitched in at the callsite. This would also reduce instruction count, maybe making deeper multi-level inlining fall within the inline size limit.
Often, values will end up being boxed to cross a call boundary and then soon afterwards unboxed by some other operation. For example, consider:
$foo.substr(0, 5).uc
Here, substr
produced a boxed Str
to feed in to uc
, which in turn will
unbox it and then use it, after boxing it back into a Str
. That is both
wasted work and a wasted memory allocation. Clearly, for maximum benefit, the
box/unbox pair elimination analysis would have to be performed post-inline.
In the code:
my int $x = 42;
$something.foo($x);
We don't know up front if the declaration of foo will be is rw
or not. It
will usually not be, but since it may be we can't just pass the 42
, we must
instead take a lexical reference and pass that. Those are costly, and can
quite easily swallow up the benefits of native types. This will require doing
an inter-procedural analysis (or restricting the optimization to only taking
place in those cases where we can inline).
Currently, once some code gets hot enough, we immediately look into the args buffer, see the incoming types, and make a specialization for that. If we get a future call where there's no specialization for the incoming types, we then go and make the specialization for those types. Nowhere is there anything trying to get an overview of the situation to make more strategic choices.
Currently, the whole set of machinery behind storing into a Scalar
container
is opaque to spesh. It can't strip out any type checks there, even when it's
in a situation where it has all the information required to do so. Also, the
WHENCE
check is unoptimizable. This isn't just a spesh thing - it will need
some further design work so spesh is in a place to do better with it. Also,
Proxy
could do with better optimization and, where applicable, inlining of
its fetch/store functions.
We could do a better job of slurpy arguments, by spitting out code that, based on the callsite shape, will fill up an array or hash. This would eliminate some of the late-bound calls by making array/hash operations more open to optimization later on.
Spesh has a bunch of guards that look into containers. We then assert things about the container content from that point on. There are various bits of code to try and invalidate said guards if there's a chance the container content may change. However, this is fragile and unformalized, and there are surely issues with it. We need a better scheme for doing this, potentially eliminating the guards that look inside containers altogether, guarding what comes out of a decont instead, and then doing the analysis to prove that the container can never be assigned to between deconts (which is useful in other ways).
Since we assume that whenever we pass a container off to a callee then it may be written to, then we cannot assume over calls that the content will not change, meaning more guards are needed. With smarter analysis we could avoid this.
When we have two loops in a given call frame, and the first one triggers OSR, then after the loop we are now in optimized code. However, there was never a chance to gather statistics about the second loop, so the code generated for the second loop will most likely be awful. Since OSR isn't ever triggered again because we're already in optimized code, that second loop will never be optimized.
For example, often the iterator we're doing a pull-one
on will have been
assigned outside of the loop. We currently don't have that information to hand
at the point of the OSR, so we can never inline that pull-one
call, or the
loop body.
Since spesh is triggered on hot code, it's often not triggered at all in the spectest. This is going to become more of an issue when we move specialization work off to a background thread. We also have no testing of spesh to spot when we regress on an optimization. We should consider ways to do better in this regard.
Right now we can look at the before/after logs of the code. However, it could be nice to get some more granular output on what spesh did and why it did it. Relatedly, it would be nice to track reasons for deoptimizations (like the expected type and the actual type). Both sets of information could be useful to provide annotated onto code so as to give an optimization report.
sub foo () {$ = 42}; for ^2000000 { $ = foo }; say now - INIT now
Gives
Cannot assign to an immutable value
Oddities with unsigned types.
Unknown issue, needs investigation.
SEGV resulting from running optimized code.
For example, the assertion that a decoder really is a decoder, or an OS handle really is an OS handle. In some cases we can prove these once and re-use that proof over multiple calls.
Perl 6 does a huge number of allocations, especially of Scalar containers, many of which could be escape analyzed and then stack-allocated. Doing this well will depend on improving alias analysis at the very least. Even if it may not be done in the nearest future, the improvement of spesh should certainly keep it in mind.
At the moment, spesh has fairly little clue about closures. It would be very
helpful if it was at least able to handle the case where the closure is taken
and then invoked by the same block, since those cases could even be inlined
due to the scoping being obvious. This would let us inline simple for
loop
bodies, for example.
No box/unbox pair elimination
This looks like a common thing in escape analysis: set an attribute on object and read it again. Here is a point class class example: http://wiki.luajit.org/Allocation-Sinking-Optimization#examples_point-class_point-class-in-c-and-javaAnd this is the reason I would like to steal the Allocation Sinking Optimization Algorithm of luajit.