jnthn/ss.md Secret

## ss.md

      
    Raw
  

              ss.md
            
          
    Spesh shortcomings

For a few years already, MoarVM "spesh" (short for specializer) has been
helping Perl 6 code run faster. By looking at what types show up at runtime,
it can then produce specialized versions for those types with a bunch of the
late binding stripped out. This includes stripping out type checks (in
general, but especially in argument binding). It also does a bunch of other
optimizations on the code. On x64, this specialized bytecode can then be
turned into machine code ("JITted").
This can make a big difference to program performance. However, nice as that
is, it's only doing a fraction of what it could do, and there's various ways
that it could be doing the things it already does do better. This document
tries to collect together the things we're keen to improve in an upcoming
round of work.
No baseline optimization; it's all or nothing

Even light optimization followed by JIT-compilation can give a decent speed
improvement over interpreted and fully unoptimized code. Today, however, it's
all or nothing: we either produce a type specialized version (and we can only
have a limited number of those) or we just interpret. And we can only have a
limited number of specialized versions.
This means that code with a lot of type variance in the incoming arguments
tends to end up being interpreted all the time, having exhausted the limited
number of specializations we're willing to produce. The lack of matching
specialized versions also means no inlining happens. This is very much the
case for methods in types like Mu in Perl 6, such as ACCEPTS, defined,
and sink. Since they have are called on a huge range of types, we quickly
run out of specialization space for the methods. This means many calls to them
are forced to run in the interpreter. Worse, they can't be inlined, despite
being otherwise incredibly attractive inlining targets.
Optimization is done on threads running code

Multi-core hardware is common today. However, the current spesh and JIT do not
take advantage of it, instead interrupting the running program to do their
work. Were it run on a separate thread, then there would be no need to block
execution of user code, allowing greater throughput. It would also remove the
pauses introduced to user code to produce specializations. Further, when many
start blocks are launched with the same code running, we will often end up
with many threads competing to do the specializations needed, which leads to
throw-away work. A separate thread dedicated to doing the optimization work
would eliminate this problem also.
Inlined code is not optimized together with the calling code

Some opportunities are missed because the inlined code is not optimized using
knowledge of the calling code that it was inlined in to. For example, things
guarded in the caller may be re-guarded in the inlinee. Once we get to the
point of having a baseline optimized version, this will become even more
important to solve. Consider inlining of Perl 6 Mu.defined. In many cases
the caller will already have guards in place against the concreteness of the
argument. Were this knowledge propagated into the inlinee, the test would turn
into a constant. Were that result further propagated out of the inlinee, it
could in turn lead to elimination of a conditional and potentially deletion of
entire basic blocks. Those in turn could lead to PHI nodes dropping to a merge
of one register, meaning precise information propagation, which would in turn
allow further optimization downstream. But all of this is missed when we just
treat inlined code as a blob of stuff we splice in.
Inlined code is always placed at the end

This makes various things easier, but it also forces a pair of branches at
either end of the inlinee. They're predictable, but it's a waste of branch
prediction logic. We should see what happens if the inlined code is instead
stitched in at the callsite. This would also reduce instruction count, maybe
making deeper multi-level inlining fall within the inline size limit.
No box/unbox pair elimination

Often, values will end up being boxed to cross a call boundary and then soon
afterwards unboxed by some other operation. For example, consider:
$foo.substr(0, 5).uc

Here, substr produced a boxed Str to feed in to uc, which in turn will
unbox it and then use it, after boxing it back into a Str. That is both
wasted work and a wasted memory allocation. Clearly, for maximum benefit, the
box/unbox pair elimination analysis would have to be performed post-inline.
No native reference elimination

In the code:
my int $x = 42;
$something.foo($x);

We don't know up front if the declaration of foo will be is rw or not. It
will usually not be, but since it may be we can't just pass the 42, we must
instead take a lexical reference and pass that. Those are costly, and can
quite easily swallow up the benefits of native types. This will require doing
an inter-procedural analysis (or restricting the optimization to only taking
place in those cases where we can inline).
Bad statistics

Currently, once some code gets hot enough, we immediately look into the args
buffer, see the incoming types, and make a specialization for that. If we get
a future call where there's no specialization for the incoming types, we then
go and make the specialization for those types. Nowhere is there anything
trying to get an overview of the situation to make more strategic choices.
No optimization of assignments

Currently, the whole set of machinery behind storing into a Scalar container
is opaque to spesh. It can't strip out any type checks there, even when it's
in a situation where it has all the information required to do so. Also, the
WHENCE check is unoptimizable. This isn't just a spesh thing - it will need
some further design work so spesh is in a place to do better with it. Also,
Proxy could do with better optimization and, where applicable, inlining of
its fetch/store functions.
Poor slurpy args specialization

We could do a better job of slurpy arguments, by spitting out code that, based
on the callsite shape, will fill up an array or hash. This would eliminate
some of the late-bound calls by making array/hash operations more open to
optimization later on.
Playing fast and loose with aliasing

Spesh has a bunch of guards that look into containers. We then assert things
about the container content from that point on. There are various bits of code
to try and invalidate said guards if there's a chance the container content may
change. However, this is fragile and unformalized, and there are surely issues
with it. We need a better scheme for doing this, potentially eliminating the
guards that look inside containers altogether, guarding what comes out of a
decont instead, and then doing the analysis to prove that the container can
never be assigned to between deconts (which is useful in other ways).
Not doing enough alias analysis

Since we assume that whenever we pass a container off to a callee then it may
be written to, then we cannot assume over calls that the content will not
change, meaning more guards are needed. With smarter analysis we could avoid
this.
OSR unhelpful when two hot loops one after the other

When we have two loops in a given call frame, and the first one triggers OSR,
then after the loop we are now in optimized code. However, there was never a
chance to gather statistics about the second loop, so the code generated for
the second loop will most likely be awful. Since OSR isn't ever triggered
again because we're already in optimized code, that second loop will never be
optimized.
OSR doesn't sample values written outside of the loop

For example, often the iterator we're doing a pull-one on will have been
assigned outside of the loop. We currently don't have that information to hand
at the point of the OSR, so we can never inline that pull-one call, or the
loop body.
Insufficient testing

Since spesh is triggered on hot code, it's often not triggered at all in the
spectest. This is going to become more of an issue when we move specialization
work off to a background thread. We also have no testing of spesh to spot when
we regress on an optimization. We should consider ways to do better in this
regard.
What did spesh do and why?

Right now we can look at the before/after logs of the code. However, it could
be nice to get some more granular output on what spesh did and why it did it.
Relatedly, it would be nice to track reasons for deoptimizations (like the
expected type and the actual type). Both sets of information could be useful
to provide annotated onto code so as to give an optimization report.
Bugs

RT #130855

sub foo () {$ = 42}; for ^2000000 { $ = foo }; say now - INIT now 

Gives
Cannot assign to an immutable value

RT #131306

Oddities with unsigned types.
RT #126364

Unknown issue, needs investigation.
MoarVM issue #552

SEGV resulting from running optimized code.
Cannot strip out type checks on various internal calls

For example, the assertion that a decoder really is a decoder, or an OS handle
really is an OS handle. In some cases we can prove these once and re-use that
proof over multiple calls.
No escape analysis yet

Perl 6 does a huge number of allocations, especially of Scalar containers,
many of which could be escape analyzed and then stack-allocated. Doing this
well will depend on improving alias analysis at the very least. Even if it may
not be done in the nearest future, the improvement of spesh should certainly
keep it in mind.
Closures, even directly invoked, are opaque to spesh

At the moment, spesh has fairly little clue about closures. It would be very
helpful if it was at least able to handle the case where the closure is taken
and then invoked by the same block, since those cases could even be inlined

due to the scoping being obvious. This would let us inline simple for loop
bodies, for example.