Recently, a Perl 6 object creation benchmark did the rounds on social media. This Perl 6 code:
class Point {
has $.x;
has $.y;
}
my $total = 0;
for ^1_000_000 {
my $p = Point.new(x => 2, y => 3);
$total = $total + $p.x + $p.y;
}
say $total;
Now (on HEAD builds of Rakudo and MoarVM) runs faster than this roughly equivalent Perl 5 code:
use v5.10;
package Point;
sub new {
my ($class, %args) = @_;
bless \%args, $class;
}
sub x {
my $self = shift;
$self->{x}
}
sub y {
my $self = shift;
$self->{y}
}
package main;
my $total = 0;
for (1..1_000_000) {
my $p = Point->new(x => 2, y => 3);
$total = $total + $p->x + $p->y;
}
say $total;
(Aside: yes, I know there's no shortage of libraries in Perl 5 that make OO nicer than this, though those I tried also made it slower.)
This is a fairly encouraging result: object creation, method calls, and attribute access are the operational bread and butter of OO programming, so it's a pleasing sign that Perl 6 is getting increasingly speedy at them. In fact, it's probably a bit better at them than this benchmark's raw numbers show, since:
- The measurements include startup time. Probably 10%-15% of the time Rakudo spends executing the program is startup, parsing, compilation, etc. (noting the compiler is itself written in Perl 6). By contrast, Perl 5 has really fast startup and a parser written in C.
- The math in Perl 6 is arbitrary precision (that is, will upgrade to a big integer if needed); it costs something to handle that
- Every math operation allocates a new
Int
object; there's two uses of+
here, so that means two newInt
allocations per iteration of the loop, which then have to be garbage collected
While dealing with Int values got faster recently,
it's still really making two Int
objects every time around that loop and
having to GC them later. An upcoming new set of analyses and optimizations
will let us get rid of that cost too. And yes, startup will get faster with
time also. In summary, while Rakudo/MoarVM is now winning that benchmark
against Perl 5, there's still lots more to be had. Which is a good job, since
the equivalent Python and Ruby versions of that benchmark still run faster
than the Perl 6 one.
In this post, I'll look at the changes that allowed this benchmark to end up faster. None of the new work was particularly ground-breaking; in fact, it was mostly a case of doing small things to make better use of optimizations we already had.
Theoretically, the default new
method in turn calls bless
, passing the
named arguments along. The bless
method then creates an object instance,
followed by calling BUILDALL
. The BUILDALL
method goes through the set
of steps needed for constructing the object. In the case of a simple object
like ours, that involves checking if an x
and y
named argument were
passed, and if so assigning those values into the Scalar
containers of the
object attributes.
For those keeping count, that's at least 3 method calls (new
, bless
,
and BUILDALL
).
However, there's a cheat. If bless
wasn't overridden (which would be an
odd thing to do anyway), then the default new
could just call BUILDALL
directly anyway. Therefore, new
looks like this:
multi method new(*%attrinit) {
nqp::if(
nqp::eqaddr(
(my $bless := nqp::findmethod(self,'bless')),
nqp::findmethod(Mu,'bless')
),
nqp::create(self).BUILDALL(Empty, %attrinit),
$bless(|%attrinit)
)
}
The BUILDALL
method was originally a little "interpreter" that went through
a per-object build plan stating what needs to be done. However, for quite some
time now we've instead compiled a per-class BUILDPLAN
method.
To figure out how to speed this up, I took a look at how the specializer was handling the code. The answer: not so well. There were certainly some positive moments in there. Of note:
- The
x
andy
accessor methods were being inlined - The
+
operators were being inlined - Everything was JIT-compiled into machine code
However, the new
method was getting only a "certain" specialization, which
is usually only done for very heavily polymorphic code. That wasn't the case
here; this program clearly construts overwhelmingly one type of object. So
what was going on?
In order to produce an "observed type" specialization - the more powerful kind - it needs to have data on the types of all of the passed arguments. And for the named arguments, that was missing. But why?
Logging of passed argument types is done on the callee side, since:
- If the callee is already specialized, then we don't want to waste time in the caller doing throwaway logging
- If the callee is not specialized, but the caller is, then we don't want to miss out on logging types (and we only log when running unspecialized code)
The argument logging was done as the interpreter executed each parameter processing instruction. However, when there's a slurpy, then it would just swallow up all the remaining arguments without logging type information. Thus the information about the argument types was missing and we ended up with a less powerful form of specialization.
Teaching the slurpy handling code about argument logging felt a bit involved, but then I realized there was a far simpler solution: log all of the things in the argument buffer at the point an unspecialized frame is being invoked. We're already logging the entry to the call frame at that point anyway. This meant that all of the parameter handling instructions got a bit simpler too, since they no longer had logging to do.
Having new
being specialized in a more significant way was an immediate win.
Of note, this part:
nqp::eqaddr(
(my $bless := nqp::findmethod(self,'bless')),
nqp::findmethod(Mu,'bless')
),
Was quite interesting. Since we were now specializing on the type of self
,
then the findmethod
could be resolved into a constant. The resolution of a method
on the constant Mu
was also a constant. Therefore, the result of the eqaddr
(same memory address) comparison of two constants should also have been turned
into a constant...except that wasn't happening! This time, it was simple: we
just hadn't implemented folding of that yet. So, that was an easy win, and once
done meant the optimizer could see that the if
would always go a certain way
and thus optimize the whole thing into the chosen branch. Thus the new
method
was specialized into something like:
multi method new(*%attrinit) {
nqp::create(self).BUILDALL(Empty, %attrinit),
}
Further, the create
could be optimized into a "fast create" op, and the
relatively small BUILDALL
method could be inlined into new
. Not bad.
At this point, things were much better, but still not quite there. I took
a look at the BUILDALL
method compilation, and realized that it could emit
faster code.
The %attrinit
is a Perl 6 Hash
object, which is for the most part a
wrapper around the lower-level VM hash object, which is the actual hash
storage. We were, curiously, already pulling this lower-level hash out of
the Hash
object and using the nqp::existskey
primitive to check if there
was a value, but were not doing that for actually accessing the value.
Instead, an .AT-KEY('x')
method call was being done. While that was being
handled fairly well - inlined and so forth - it also does its own bit of
existence checking. I realized we could just use the nqp::atkey
primitive
here instead.
Later on, I also realized that we could do away with nqp::existskey
and
just use nqp::atkey
. Since a VM-level null is something that never exists in
Perl 6, we can safely use it as a sentinel to mean that there is no value.
That got us down to a single hash lookup.
By this point, we were just about winning the benchmark, but only by a few percent. Were there any more easy pickings?
My next surprise was that the new
method didn't get inlined. It was within
the size limit. There was nothing preventing it. What was going on?
Looking closer, it was even worse than that. Normally, when something is too big to be inlined, but we can work out what specialization will be called on the target, we do "specialization linking", indicating which specialization of the code to call. That wasn't happening. But why?
Some debugging later, I sheepishly fixed an off-by-one in the code that looks
through a multi-dispatch cache to see if a particular candidate matches the
set of argument types we have during optimization of a call instruction. This
probably increased the performance of quite a few calls involving passing
named arguments to multi-methods - and meant new
was also inlined.
The next round of performance improvements - both to this code and plenty more besides - will come from escape analysis, scalar replacement, and related optimizations. I've already started on that work, though expect it will keep me busy for quite a while. I will, however, be able to deliver it in stages, and am optimistic I'll have the first stage of it ready to talk about in a week or so.