Around 18 months ago, I set about working on the largest set of architectural changes that Raku runtime MoarVM has seen since its inception. The work was most directly triggered by the realization that we had no good way to fix a certain semantic bug in dispatch without either causing huge performance impacts across the board or increasingly complexity even further in optimizations that were already riding their luck. However, the need for something like this had been apparent for a while: a persistent struggle to optimize certain Raku language features, the pain of a bunch of performance mechanisms that were all solving the same kind of problem but each for a specific situation, and a sense that, with everything learned since I founded MoarVM, it was possible to do better.
The result is the development of a new generalized dispatch mechanism. An overview can be found in my Raku Conference talk about it (slides, video); in short, it gives us a far more uniform architecture for all kinds of dispatch, allowing us to deliver better performance on a range of language features that have thus far been glacial, as well as opening up opportunities for new optimizations.
Today, this work has been merged, along with the matching changes in NQP (the Raku subset we use for bootstrapping and to implement the compiler) and Rakudo (the full Raku compiler and standard library implementation). This means that it will ship in the October 2021 releases.
In this post, I'll give an overview of what you can expect to observe right away, and what you might expect in the future as we continue to build upon the possibilities that the new dispatch architecture has to offer.
The biggest improvements involve language features that we'd really not had the architecture to do better on before. They involved dispatch - that is, getting a call linked to a destination efficiently - but the runtime didn't provide us with a way to "explain" to it that it was looking at a dispatch, let alone with the information needed to have a shot at optimizing it.
The following graph captures a number of these cases, and shows the level of improvement, ranging from a factor of 3.3 to 13.3 times faster.
Let's take a quick look at each of these. The first, new-buf
, asks how
quickly we can allocate Buf
s.
for ^10_000_000 {
Buf.new
}
Why is this a dispatch benchmark? Because Buf
is not a class, but rather a
role. When we try to make an instance of a role, it is "punned" into a class.
Up until now, it works as follows:
- We look up the
new
method - The
find_method
method would, if needed, create a pun of the role and cache it - It would return a forwarding closure that takes the arguments and gives
them to the same method called on the punned class, or spelt in Raku code,
-> $role-discarded, |args { $pun."$name"(|args) }
- This closure would be invoked with the arguments
This had a number of undesirable consequences:
- While the pun was cached, we still had a bit of overhead to check if we'd made it already
- The arguments got slurped and flattened, which costs something, and...
- ...the loss of callsite shape meant we couldn't look up a type specialization of the method, and thus lost a chance to inline it too
With the new dispatch mechanism, we have a means to cache constants at a given program location and to replace arguments. So the first time we encounter the call, we:
- Get the role pun produced if needed
- Resolve the
new
method on the class punned from the role - Produce a dispatch program that caches this resolved method and also replaces the role argument with the pun
For the next thousands of calls, we interpret this dispatch program. It's
still some cost, but the method we're calling is already resolved, and the
argument list rewriting is fairly cheap. Meanwhile, after we get into some
hundreds of iterations, on a background thread, the optimizer gets to
work. The argument re-ordering cost goes away completely at this point, and
new
is so small it gets inlined - at which point the buffer allocation is
determined dead and so goes away too. Some remaining missed opportunities
mean we still are left with a loop that's not quite empty: it busies itself
making sure it's really OK to do nothing, rather than just doing nothing.
Next up, multiple dispatch with where
clauses.
multi fac($n where $n <= 1) { 1 }
multi fac($n) { $n * fac($n - 1) }
for ^1_000_000 {
fac(5)
}
These were really slow before, since:
- We couldn't apply the multi-dispatch caching mechanism at all as soon as
we had a
where
clause involved - We would run
where
clauses twice in the event the candidate was chosen: once to see if we should choose that multi candidate, and once again when we entered it
With the new mechanism, we:
- On the first call, calculate a multiple dispatch plan: a linked list of candidates to work through
- Invoke the one with the
where
clause, in a mode whereby if the signature fails to bind, it triggers a dispatch resumption. (If it does bind, it runs to completion) - In the event of a bind failure, the dispatch resumption triggers, and we attempt the next candidate
Once again, after the setup phase, we interpret the dispatch programs. In fact, that's as far as we get with running this faster for now, because the specializer doesn't yet know how to translate and further optimize this kind of dispatch program. (That's how I know it currently stands no chance of turning this whole thing into another empty loop!) So there's more to be had here also; in the meantime, I'm afraid you'll just have to settle for a factor of ten speedup.
Here's the next one:
proto with-proto(Int $n) { 2 * {*} }
multi with-proto(Int $n) { $n + 1 }
sub invoking-nontrivial-proto() {
for ^10_000_000 {
with-proto(20)
}
}
Again, on top form, we'd turn this into an empty loop too, but we don't quite
get there yet. This case wasn't so terrible before: we did get to use the
multiple dispatch cache, however to do that we also ended up having to allocate
an argument capture. The need for this also blocked any chance of inlining the
proto
into the caller. Now that is possible. Since we cannot yet translate
dispatch programs that resume an in-progress dispatch, we don't yet get to
further inline the called multi
candidate into the proto
. However, we now
have a design that will let us implement that.
This whole notion of a dispatch resumption - where we start doing a dispatch,
and later need to access arguments or other pre-calculated data in order to
do a next step of it - has turned out to be a great unification. The initial
idea for it came from considering things like callsame
:
class Parent {
method m() { 1 }
}
class Child is Parent {
method m() { 1 + callsame }
}
for ^10_000_000 {
Child.m;
}
Once I started looking at this, and then considering that a complex proto
also wants to continue with a dispatch at the {*}
, and in the case a
where
clauses fails in a multi
it also wants to continue with a
dispatch, I realized this was going to be useful for quite a lot of things.
It will be a bit of a headache to teach the optimizer and JIT to do nice
things with resumes - but a great relief that doing that once will benefit
multiple language features!
Anyway, back to the benchmark. This is another "if we were smart, it'd be
an empty loop" one. Previously, callsame
was very costly, because each
time we invoked it, it would have to calculate what kind of dispatch we
were resuming and the set of methods to call. We also had to be able to
locate the arguments. Dynamic variables were involved, which cost a bit
to look up too, and - despite being an implementation details - these also
leaked out in introspection, which wasn't ideal. The new dispatch mechanism
makes this all rather more efficient: we can cache the calculated set of
methods (or wrappers and multi candidates, depending on the context) and
then walk through it, and there's no dynamic variables involved (and thus
no leakage of them). This sees the biggest speedup of the lot - and since
we cannot yet inline away the callsame
, it's (for now) measuring the
speedup one might expect on using this language feature. In the future,
it's destined to optimize away to an empty loop.
A module that makes use of callsame
on a relatively hot path is
OO::Monitors,
, so I figured it would be interesting to see if there
is a speedup there also.
use OO::Monitors;
monitor TestMonitor {
method m() { 1 }
}
my $mon = TestMonitor.new;
for ^1_000_000 {
$mon.m();
}
A monitor
is a class that acquires a lock around each method call. The
module provides a custom meta-class that adds a lock attribute to the
class and then wraps each method such that it acquires the lock. There are
certainly costly things in there besides the involvement of callsame
, but
the improvement to callsame
is already enough to see a 3.3x speedup in this
benchmark. Since OO::Monitors
is used in quite a few applications and
modules (for example, Cro uses it), this is welcome (and yes, a larger
improvement will be possible here too).
I've seen some less impressive, but still welcome, improvements across a good
number of other microbenchmarks. Even a basic multi dispatch on the +
op:
my $i = 0;
for ^10_000_000 {
$i = $i + $_;
}
Comes out with a factor of 1.6x speedup, thanks primarily to us producing far
tighter code with fewer guards. Previously, we ended up with duplicate guards
in this seemingly straightforward case. The infix:<+>
multi candidate would
be specialized for the case of its first argument being an Int
in a Scalar
container and its second argument being an immutable Int
. Since a Scalar
is mutable, the specialization would need to read it and then guard the value
read before proceeding, otherwise it may change, and we'd risk memory safety.
When we wanted to inline this candidate, we'd also want to do a check that
the candidate really applies, and so also would deference the Scalar
and
guard its content to do that. We can and do eliminate duplicate guards - but
these guards are on two distinct reads of the value, so that wouldn't help.
Since in the new dispatch mechanism we can rewrite arguments, we can now quite
easily do caller-side removal of Scalar
containers around values. So easily,
in fact, that the change to do it took me just a couple of hours. This gives a
lot of benefits. Since dispatch programs automatically eliminate duplicate reads
and guards, the read and guard by the multi-dispatcher and the read in order to
pass the decontainerized value are coalesced. This means less repeated work prior
to specialization and JIT compilation, and also only a single read and guard in
the specialized code after it. With the value to be passed already guarded, we
can trivially select a candidate taking two bare Int
values, which means
there's no further reads and guards needed in the callee either.
A less obvious benefit, but one that will become important with planned future
work, is that this means Scalar
containers escape to callees far less often.
This creates further opportunities for escape analysis. While the MoarVM escape
analyzer and scalar replacer is currently quite limited, I hope to return to
working on it in the near future, and expect it will be able to give us even
more value now than it would have been able to before.
The benchmarks shown earlier are mostly of the "how close are we to realizing that we've got an empty loop" nature, which is interesting for assessing how well the optimizer can "see through" dispatches. Here are a few further results on more "traditional" microbenchmarks:
The complex number benchmark is as follows:
my $total-re = 0e0;
for ^2_000_000 {
my $x = 5 + 2i;
my $y = 10 + 3i;
my $z = $x * $x + $y;
$total-re = $total-re + $z.re
}
say $total-re;
That is, just a bunch of operators (multi dispatch) and method calls, where we
really do use the result. For now, we're tied with Python and a little behind
Ruby on this benchmark (and a surprising 48 times faster than the same thing
done with Perl's Math::Complex
), but this is also a case that stands to see
a huge benefit from escape analysis and scalar replacement in the future.
The hash read benchmark is:
my %h = a => 10, b => 12;
my $total = 0;
for ^10_000_000 {
$total = $total + %h<a> + %h<b>;
}
And the hash store one is:
my @keys = 'a'..'z';
for ^500_000 {
my %h;
for @keys {
%h{$_} = 42;
}
}
The improvements are nothing whatsoever to do with hashing itself, but instead look to be mostly thanks to much tighter code all around due to caller-side decontainerization. That can have a secondary effect of bringing things under the size limit for inlining, which is also a big help. Speedup factors of 2x and 1.85x are welcome, although we could really do with the same level of improvement again for me to be reasonably happy with our results.
The line-reading benchmark is:
my $fh = open "longfile";
my $chars = 0;
for $fh.lines { $chars = $chars + .chars };
$fh.close;
say $chars
Again, nothing specific to I/O got faster, but when dispatch - the glue that puts together all the pieces - gets a boost, it helps all over the place. (We are also decently competitive on this benchmark, although tend to be slower the moment the UTF-8 decoder can't take it's "NFG can't possibly apply" fast path.)
I've also started looking at larger programs, and hearing results from others about theirs. It's mostly encouraging:
- The long-standing
Text::CSV
benchmarktest-t
has seen roughly 20% improvement (thanks to lizmat for measuring) - A simple
Cro::HTTP
test application gets through about 10% more requests per second - MoarVM contributor dogbert did comparative timings of a number of scripts; the most significant improvement saw a drop from 25s to 7s, most are 10%-30% faster, some without change, and only one that slowed down.
- There's around 2.5% improvement on compilation of
CORE.setting
, the standard library. However, a big pinch of salt is needed here: the compiler itself has changed in a number of places as part of the work, and there were a couple of things tweaked based on looking at profiles that aren't really related to dispatch. - Agrammon, an application calculating farming emissions, has seen a slowdown of around 9%. I didn't get to look at it closely yet, although glancing at profiling output the number of deoptimizations is relatively high, which suggests we're making some poor optimization decisions somewhere.
One unpredicted (by me), but also welcome, improvement is that profiler output has become significantly smaller. Likely reasons for this include:
- The dispatch mechanism supports producing value results (either from constants, input arguments, or attributes read from input arguments). It entirely replaces an earlier mechanism, "specializer plugins", which could map guards to a target to invoke, but always required a call to something - even if that something was the identity function. The logic was that this didn't matter for any really hot code, since the identity function will trivially be inlined away. However, since profile size of the instrumenting profiler is a function of the number of paths through the call tree, trimming loads of calls to the identity function out of the tree makes it much smaller.
- We used to make lots of calls to the
sink
method when a value was in sink context. Now, if we see that the type simply inherits that method fromMu
, we elide the call entirely (again, it would inline away, but a smaller call graph is a smaller profile). - Multiple dispatch caching would previously always call the
proto
when the cache was missed, but would then not call an onlystarproto
again when it got cache hits in the future. This meant the call tree under many multiple dispatches was duplicated in the profile. This wasn't just a size issue; it was a bit annoying to have this effect show up in the profile reports too.
To give an example of the difference, I took profiles from Agrammon to study why it might have become slower. The one from before the dispatcher work weighed in at 87MB; the one with the new dispatch mechanism is under 30MB. That means less memory used while profiling, less time to write the profile out to disk afterwards, and less time for tools to load the profiler output. So now it's faster to work out how to make things faster.
I'm afraid so. Startup time has suffered. While the new dispatch mechanism is more powerful, pushes more complexity out of the VM into high level code, and is more conducive to reaching higher peak performance, it also has a higher warmup time. At the time of writing, the impact on startup time seems to be around 25%. I expect we can claw some of that back ahead of the October release.
Changes of this scale always come with an amount of risk. We're merging this some weeks ahead of the next scheduled monthly release in order to have time for more testing, and to address any regressions that get reported. However, even before reaching the point of merging it, we have:
- Ensured it passes the specification test suite, both in normal circumstances, but also under optimizer stressing (where we force it to prematurely optimize everything, so that we tease out optimizer bugs and - given how many poor decisions we force it to make - deoptimization bugs too)
- Used
blin
to run the tests of ecosystem modules. This is a standard step when preparing Rakudo releases, but in this case we've aimed it at thenew-disp
branches. This found a number of regressions caused by the switch to the new dispatch mechanism, which have been addressed. - Patched or sent pull requests to a number of modules that were relying on unsupported internal APIs that have now gone away or changed, or on other implementation details. There were relatively few of these, and happily, many of them were fixed up by migrating to supported APIs (which likely didn't exist at the time the modules were written).
As I've alluded to in a number of places in this post, while there are improvements to be enjoyed right away, there are also new opportunities for further improvement. Some things that are on my mind include:
- Reworking callframe entry and exit. These are still decidedly too costly. Various changes that have taken place while working on the new dispatch mechanism have opened up new opportunities for improvement in this area.
- Avoiding megamorphic pile-ups. Micro-benchmarks are great at hiding these.
In fact, the
callsame
one here is a perfect example! The point we do the resumption of a dispatch is insidecallsame
, so all the inline cache entries of resumptions throughout the program stack up in one place. What we'd like is to have them attached a level down the callstack instead. Otherwise, the level ofcallsame
improvement seen in micro-benchmarks will not be enjoyed in larger applications. This applies in a number of other situations too. - Applying the new dispatch mechanism to optimize further constructs. For
example, a method call that results in invoking the special
FALLBACK
method could have its callsite easily rewritten to do that, opening the way to inlining. - Further tuning the code we produce after optimization. There is an amount of waste that should be relatively straightforward to eliminate, and some opportunities to tweak deoptimization such that we're able to delete more instructions and still retain the ability to deoptimize.
- Continuing with the escape analysis work I was doing before, which should
now be rather more valuable. The more flexible callstack/frame handling in
place should also unblock my work on scalar replacement of
Int
s (which needs a great deal of care in memory management, as they may box a big integer, not just a native integer). - Implementing specialization, JIT, and inlining of dispatch resumptions.
I would like to thank TPF and their donors for providing the funding that has made it possible for me to spend a good amount of my working time on this effort.
While I'm to blame for the overall design and much of the implementation of the new dispatch mechanism, plenty of work has also been put in by other MoarVM and Rakudo contributors - especially over the last few months as the final pieces fell into place, and we turned our attention to getting it production ready. I'm thankful to them not only for the code and debugging contributions, but also much support and encouragement along the way. It feels good to have this merged, and I look forward to building upon it in the months and years to come.