Skip to content

Instantly share code, notes, and snippets.

@jnthn

jnthn/ss.md Secret

Last active June 28, 2017 20:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jnthn/95d0705d225ebfbad5dee0faa337624b to your computer and use it in GitHub Desktop.
Save jnthn/95d0705d225ebfbad5dee0faa337624b to your computer and use it in GitHub Desktop.
MoarVM spesh shortcomings

Spesh shortcomings

For a few years already, MoarVM "spesh" (short for specializer) has been helping Perl 6 code run faster. By looking at what types show up at runtime, it can then produce specialized versions for those types with a bunch of the late binding stripped out. This includes stripping out type checks (in general, but especially in argument binding). It also does a bunch of other optimizations on the code. On x64, this specialized bytecode can then be turned into machine code ("JITted").

This can make a big difference to program performance. However, nice as that is, it's only doing a fraction of what it could do, and there's various ways that it could be doing the things it already does do better. This document tries to collect together the things we're keen to improve in an upcoming round of work.

No baseline optimization; it's all or nothing

Even light optimization followed by JIT-compilation can give a decent speed improvement over interpreted and fully unoptimized code. Today, however, it's all or nothing: we either produce a type specialized version (and we can only have a limited number of those) or we just interpret. And we can only have a limited number of specialized versions.

This means that code with a lot of type variance in the incoming arguments tends to end up being interpreted all the time, having exhausted the limited number of specializations we're willing to produce. The lack of matching specialized versions also means no inlining happens. This is very much the case for methods in types like Mu in Perl 6, such as ACCEPTS, defined, and sink. Since they have are called on a huge range of types, we quickly run out of specialization space for the methods. This means many calls to them are forced to run in the interpreter. Worse, they can't be inlined, despite being otherwise incredibly attractive inlining targets.

Optimization is done on threads running code

Multi-core hardware is common today. However, the current spesh and JIT do not take advantage of it, instead interrupting the running program to do their work. Were it run on a separate thread, then there would be no need to block execution of user code, allowing greater throughput. It would also remove the pauses introduced to user code to produce specializations. Further, when many start blocks are launched with the same code running, we will often end up with many threads competing to do the specializations needed, which leads to throw-away work. A separate thread dedicated to doing the optimization work would eliminate this problem also.

Inlined code is not optimized together with the calling code

Some opportunities are missed because the inlined code is not optimized using knowledge of the calling code that it was inlined in to. For example, things guarded in the caller may be re-guarded in the inlinee. Once we get to the point of having a baseline optimized version, this will become even more important to solve. Consider inlining of Perl 6 Mu.defined. In many cases the caller will already have guards in place against the concreteness of the argument. Were this knowledge propagated into the inlinee, the test would turn into a constant. Were that result further propagated out of the inlinee, it could in turn lead to elimination of a conditional and potentially deletion of entire basic blocks. Those in turn could lead to PHI nodes dropping to a merge of one register, meaning precise information propagation, which would in turn allow further optimization downstream. But all of this is missed when we just treat inlined code as a blob of stuff we splice in.

Inlined code is always placed at the end

This makes various things easier, but it also forces a pair of branches at either end of the inlinee. They're predictable, but it's a waste of branch prediction logic. We should see what happens if the inlined code is instead stitched in at the callsite. This would also reduce instruction count, maybe making deeper multi-level inlining fall within the inline size limit.

No box/unbox pair elimination

Often, values will end up being boxed to cross a call boundary and then soon afterwards unboxed by some other operation. For example, consider:

$foo.substr(0, 5).uc

Here, substr produced a boxed Str to feed in to uc, which in turn will unbox it and then use it, after boxing it back into a Str. That is both wasted work and a wasted memory allocation. Clearly, for maximum benefit, the box/unbox pair elimination analysis would have to be performed post-inline.

No native reference elimination

In the code:

my int $x = 42;
$something.foo($x);

We don't know up front if the declaration of foo will be is rw or not. It will usually not be, but since it may be we can't just pass the 42, we must instead take a lexical reference and pass that. Those are costly, and can quite easily swallow up the benefits of native types. This will require doing an inter-procedural analysis (or restricting the optimization to only taking place in those cases where we can inline).

Bad statistics

Currently, once some code gets hot enough, we immediately look into the args buffer, see the incoming types, and make a specialization for that. If we get a future call where there's no specialization for the incoming types, we then go and make the specialization for those types. Nowhere is there anything trying to get an overview of the situation to make more strategic choices.

No optimization of assignments

Currently, the whole set of machinery behind storing into a Scalar container is opaque to spesh. It can't strip out any type checks there, even when it's in a situation where it has all the information required to do so. Also, the WHENCE check is unoptimizable. This isn't just a spesh thing - it will need some further design work so spesh is in a place to do better with it. Also, Proxy could do with better optimization and, where applicable, inlining of its fetch/store functions.

Poor slurpy args specialization

We could do a better job of slurpy arguments, by spitting out code that, based on the callsite shape, will fill up an array or hash. This would eliminate some of the late-bound calls by making array/hash operations more open to optimization later on.

Playing fast and loose with aliasing

Spesh has a bunch of guards that look into containers. We then assert things about the container content from that point on. There are various bits of code to try and invalidate said guards if there's a chance the container content may change. However, this is fragile and unformalized, and there are surely issues with it. We need a better scheme for doing this, potentially eliminating the guards that look inside containers altogether, guarding what comes out of a decont instead, and then doing the analysis to prove that the container can never be assigned to between deconts (which is useful in other ways).

Not doing enough alias analysis

Since we assume that whenever we pass a container off to a callee then it may be written to, then we cannot assume over calls that the content will not change, meaning more guards are needed. With smarter analysis we could avoid this.

OSR unhelpful when two hot loops one after the other

When we have two loops in a given call frame, and the first one triggers OSR, then after the loop we are now in optimized code. However, there was never a chance to gather statistics about the second loop, so the code generated for the second loop will most likely be awful. Since OSR isn't ever triggered again because we're already in optimized code, that second loop will never be optimized.

OSR doesn't sample values written outside of the loop

For example, often the iterator we're doing a pull-one on will have been assigned outside of the loop. We currently don't have that information to hand at the point of the OSR, so we can never inline that pull-one call, or the loop body.

Insufficient testing

Since spesh is triggered on hot code, it's often not triggered at all in the spectest. This is going to become more of an issue when we move specialization work off to a background thread. We also have no testing of spesh to spot when we regress on an optimization. We should consider ways to do better in this regard.

What did spesh do and why?

Right now we can look at the before/after logs of the code. However, it could be nice to get some more granular output on what spesh did and why it did it. Relatedly, it would be nice to track reasons for deoptimizations (like the expected type and the actual type). Both sets of information could be useful to provide annotated onto code so as to give an optimization report.

Bugs

RT #130855

sub foo () {$ = 42}; for ^2000000 { $ = foo }; say now - INIT now 

Gives

Cannot assign to an immutable value

RT #131306

Oddities with unsigned types.

RT #126364

Unknown issue, needs investigation.

MoarVM issue #552

SEGV resulting from running optimized code.

Cannot strip out type checks on various internal calls

For example, the assertion that a decoder really is a decoder, or an OS handle really is an OS handle. In some cases we can prove these once and re-use that proof over multiple calls.

No escape analysis yet

Perl 6 does a huge number of allocations, especially of Scalar containers, many of which could be escape analyzed and then stack-allocated. Doing this well will depend on improving alias analysis at the very least. Even if it may not be done in the nearest future, the improvement of spesh should certainly keep it in mind.

Closures, even directly invoked, are opaque to spesh

At the moment, spesh has fairly little clue about closures. It would be very helpful if it was at least able to handle the case where the closure is taken and then invoked by the same block, since those cases could even be inlined
due to the scoping being obvious. This would let us inline simple for loop bodies, for example.

@zhuomingliang
Copy link

zhuomingliang commented Jun 28, 2017

No box/unbox pair elimination This looks like a common thing in escape analysis: set an attribute on object and read it again. Here is a point class class example: http://wiki.luajit.org/Allocation-Sinking-Optimization#examples_point-class_point-class-in-c-and-java
And this is the reason I would like to steal the Allocation Sinking Optimization Algorithm of luajit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment