Faster dispatches with MoarVM specializer plugins
One of the goals for the current round
of my Perl Foundation Performance and Reliability grant is to speed up private
method calls in roles and Scalar
container assignments. What I didn't expect
at the time I wrote the grant application is that these two would lead to a
single new mechanism in MoarVM to make them possible.
The Scalar
container assignment improvements are still to come; currently I
have a plan and hope to make good progress on it next week. I do, however, have
a range of dispatch-related performance improvements to show, including the
private method case.
Background
MoarVM runs programs faster by analyzing how they run and producing specialized versions of parts of the program based on that information. It takes note of which code is run often (frequently called methods and hot loops), which types a block of code is called with, what types are returned from calls, what code a closure points to, and more. Note that it observes the runtime behavior, and so is not dependent on whether the program has type annotations or not.
Calls are one of the most important things that the optimizer considers, be they
method calls, subroutine calls or invoking a received closure. Method calls are
especially interesting, because with a call like $obj.meth($arg)
, the method
to be called depends on the exact type of $obj
. Often, we end up producing a
version of the code that is specialized for a particular type of $obj
. We can
therefore resolve the method once in this specialization, saving the method
lookup overhead.
But there's more. Once we know exactly what method we'll be calling, and if the method is fairly small, we can inline it into the caller, this eliminating the call overhead too. Further, since we are inlining a specialized version of the code and have already proved that we meet the conditions for using that specialization, we can eliminate type checks on parameters. Inlining is even more powerful than that: it opens the door to a wider range of analyses that would not be possible without it, which lead to futher program optimizations.
The problem
We can do this kind of optimization with method calls because MoarVM understands about method calls. It knows that if it is holding the type of the invocant constant, then the result of the dispatch can also be held constant.
Unfortunately, there's more than one case of method calling in Perl 6. While
the majority of calls take the familiar $obj.foo
from, we also have:
- Private method calls, of the form
$obj!foo
or$obj!TrustsMe::foo
- Qualified method calls, of the form
$obj.Some::Role.meth()
(it may be a base class too, though this form is at its most useful in doing conflict resolution for roles) - Duck method calls, of the form
$obj.?foo
, where if the method doesn't exist then we just accept that and move along
In the first case, if the class is in a class, then we can resolve it at
compilation time, since private methods aren't virtual. Such calls are thus
pretty fast. But what if the private method call was in a role? Well, then it
was far slower. It took a method call on the meta-object, which then did a
hash lookup to find the method, followed by invoking that method. This work
was done by a call to a dispatch:<!>
utility method. It was the same case
story for qualified calls and duck calls.
So, let's extend MoarVM to understand these kinds of calls?
So if normal method calls are faster because MoarVM understands them, surely we can do better by teaching it to understand these other forms of calling too? Perhaps we could add some new ops to the VM to represent these kinds of calls?
Maybe, but all of them come with their own rules. And those rules are already
implemented in the metamodel. We make normal method calls fast by precomputing
a method cache, which is just a hash table, and have the specializer do its
lookups in that. While such an approach might work for private methods, it
gets decidedly trickier in the other two cases. Plus those precomputed hashes
take up a lot of space. There are hundreds of exception types in CORE.setting
and every one of them as a precomputed hash table of all of its methods, with
those from base classes denormalized in to it, with all the methods that are
inherited from Mu
, Any
, and Exception
. We do lazily deserialize these,
which helps, but it's still fairly costly. Introducing more such things, when
I already want rid of that one, didn't feel like a good direction.
No, let's make MoarVM teachable
Earlier in the post, I wrote this:
It [the optimizer] knows that if it is holding the type of the invocant [of a method call] constant, then the result of the dispatch can also be held constant.
And this is the key. The important thing isn't that the specializer knows the precise semantics of method dispatch. The important thing is that it knows the relationship between the arguments to a dispatch (e.g. the type that we're calling the method on) and the result of the dispatch.
This, along with considering the challenges of optimizing Scalar
assignments,
led me to the idea of introducing a mechanism in MoarVM where we can tell it
about these relationships. This enables the specializer to insert guards as
needed and then simply use the calculated result of the dispatch.
Specializer plugins
The new mechanism is known as "spesh plugins", and I merged it into MoarVM's master branch today. It works a few steps. The first is that one registers a spesh plugin. Here's the one for helping optimize private method calls:
nqp::speshreg('perl6', 'privmeth', -> $obj, str $name {
nqp::speshguardtype($obj, $obj.WHAT);
$obj.HOW.find_private_method($obj, $name)
});
The registration provides the langauge the plugin is for, the name of the plugin,
and a callback. The callback takes an object and a method name. The second line
is the key here. It indicates that the result that will be returned from this
plugin will be valid provided the type of $obj
precisely matches (that is, with
no regard to subtyping relationships) the type of the $obj
we are currently
considering. Therefore, it establishes a relationship between the invocant type
and the private method call result.
Then, we just need to compile a private method call like:
self!foo($bar, $baz)
Into:
nqp::speshresolve('privmeth', self, 'foo')(self, $bar, $baz)
Taking care to only evaluate self
once (obviously not a problem for self
,
but in general it can be any expression, and may have side-effects).
And that's in. So what happens at runtime?
When the interpreter encounters this call for the first time, it calls the plugin. It then stores the result along the the conditions. On later calls made in the interpreter, it uses this mapping table to fairly quickly map the invocant type into the appropriate result. It's a little cache. (Aside: this is a little more involved because we want lookups without locking, but also need to cope with multiple threads creating resolution races. Thanks to a generalized free-at-safepoint mechanism in MoarVM, this isn't so hard.)
So that's nice, and on its own would already probably be an improvement over what it replaced. But we haven't even got to the exciting part yet! Each time we use this mapping, it records which mapping was used for the benefit of the optimizer. This information is stored in such a way that the specializer can work out which mappings are used with a particular set of parameter types to the method. So, in:
role R {
method foo() {
self!bar
}
}
class C1 does R {
method !bar() { 1 }
}
class C2 does R {
method !bar() { 2 }
}
The method foo
might be invoked with invocants of type C1
and C2
. Thus
the mapping table for the call self!bar
will have two entries. We may (if the
code is hot) produce two specializations of method foo
, and if we do, then we
will also be able to see that there is only ever one target of the private
method call. Thus, we can inline !bar
into foo
.
Results
Writing a module PM.pm6
that contains:
role R {
method m() { self!p }
method !p() { 42 }
}
class C does R {
}
for ^10_000_000 {
C.m;
}
And then running it with perl6 -I. -e 'use PM6'
used to run in 5.5s on my
development machine at the office. That's only 1.8 million iterations of the
loop per second, which may sound like quite a lot, but it means each is eating
a whopping 1,650 CPU cycles on a 3GHz CPU.
With the new spesh plugin mechanism, it runs in 0.83s, over 6.5x faster. It's over 12 million iterations of the loop per second, or around 250 CPU cycles per iteration. That's still a good bit higher than would be good, but it's a heck of a lot better.
Note that due to the way roles are handled in non-precompiled code, the use of the spesh plugin will not happen at present in a role in a script, thus why in this case I put the code into a module. This restriction can be lifted later.
But wait, there's more
I also wrote a spesh plugin for qualified dispatches, like $obj.Foo::meth()
.
This one guards on two of its inputs, and has an error case to handle. Notice
how we can avoid replicating this logic inside of MoarVM itself and just write
it in NQP code.
nqp::speshreg('perl6', 'qualmeth', -> $obj, str $name, $type {
nqp::speshguardtype($obj, $obj.WHAT);
if nqp::istype($obj, $type) {
# Resolve to the correct qualified method.
nqp::speshguardtype($type, $type.WHAT);
$obj.HOW.find_method_qualified($obj, $type, $name)
}
else {
# We'll throw an exception; return a thunk that will delegate to the
# slow path implementation to do the throwing.
-> $inv, *@pos, *%named {
$inv.'dispatch:<::>'($name, $type, |@pos, |%named)
}
}
});
This gave an even more dramatic speedup. The program:
role R1 {
method m() { 1 }
}
role R2 {
method m() { 2 }
}
class C does R1 does R2 {
method m() {
self.R2::m()
}
}
for ^10_000_000 {
C.m
}
Used to take 13.3s. With the spesh plugin in effect, it now takes 1.07s, a factor of more than 12x improvement.
And even a little more...
I also wondered if I could get $obj.?foo
duck dispatches to do better. The
answer turend out to be yes. First of all, here's the plugin:
sub discard-and-nil(*@pos, *%named) { Nil }
nqp::speshreg('perl6', 'maybemeth', -> $obj, str $name {
nqp::speshguardtype($obj, $obj.WHAT);
my $meth := $obj.HOW.find_method($obj, $name);
nqp::isconcrete($meth)
?? $meth
!! &discard-and-nil
});
There's a couple of cases I decided to measure here. The first is the one where
we wrote code with a .?
call in to handle the general case (for example, in a
module), but then the program using the module always (or > 99% of the time) gives
an object where we can call the method.
class C {
}
class D {
method m() { 42 }
}
for ^10_000_000 {
(rand > 0.999 ?? C !! D).?m()
}
The rand
call, compare, and conditional are all costs in this code besides
the thing I wanted to measure, so it's not such a direct measurement of the real
speedup of .?
. Stil, this program went from taking 10.9s before to 4.29s with
the spesh plugin in place - an improvement of 2.5x. It achieves this by doing a
speculative inline of the method m
anyway, and then using deoptimization to
fall back to the interpreter to handle the 0.1% of cases where we get C and not
D. (It then, at the end of the loop body, falls back into the hot code again.)
But had this come at the cost of making really polymorphic cases slower? Here's another benchmark:
class C {
}
class D {
method m() { 42 }
}
for ^10_000_000 {
(rand > 0.5 ?? C !! D).?m()
}
This one goes from 7.60s to 4.92s, a 1.5x speedup. Spesh can't just punt this to doing a deopt for the uncommon case, because there is no uncommon case. Still, the guard table scan comes out ahead.
(By the way, I think a lot of the slowness in this code - though I didn't think
of it when I wrote the benchmark - is that rand
returns a Num
, but 0.5
and 0.999
are Rat
s, so it's doing a type coercion before comparing.)
And what next?
Next I'll be taking on Scalar
containers and assignment, seeing what I can do
with spesh plugins there, and hoping my ideas lead to as positive results as has
been seen here.
Also, this isn't the final word on the various benchmarks in this post either. I know full well that the current spesh plugin implementation is inserting some redundant guards, and a bit of effort on that front will get us another win.