jnthn/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Speeding up object creation

Recently, a Perl 6 object creation benchmark did the rounds
on social media. This Perl 6 code:
class Point {
    has $.x;
    has $.y;
}
my $total = 0;
for ^1_000_000 {
    my $p = Point.new(x => 2, y => 3);
    $total = $total + $p.x + $p.y;
}
say $total;

Now (on HEAD builds of Rakudo and MoarVM) runs faster than this roughly
equivalent Perl 5 code:
use v5.10;

package Point;

sub new {
    my ($class, %args) = @_;
    bless \%args, $class;
}

sub x {
    my $self = shift;
    $self->{x}
}

sub y {
    my $self = shift;
    $self->{y}
}

package main;

my $total = 0;
for (1..1_000_000) {
    my $p = Point->new(x => 2, y => 3);
    $total = $total + $p->x + $p->y;
}
say $total;

(Aside: yes, I know there's no shortage of libraries in Perl 5 that make OO
nicer than this, though those I tried also made it slower.)
This is a fairly encouraging result: object creation, method calls, and
attribute access are the operational bread and butter of OO programming, so
it's a pleasing sign that Perl 6 is getting increasingly speedy at them. In
fact, it's probably a bit better at them than this benchmark's raw numbers
show, since:

The measurements include startup time. Probably 10%-15% of the time Rakudo
spends executing the program is startup, parsing, compilation, etc. (noting
the compiler is itself written in Perl 6). By contrast, Perl 5 has really
fast startup and a parser written in C.
The math in Perl 6 is arbitrary precision (that is, will upgrade to a big
integer if needed); it costs something to handle that
Every math operation allocates a new Int object; there's two uses of +
here, so that means two new Int allocations per iteration of the loop,
which then have to be garbage collected

While dealing with Int values got faster recently,
it's still really making two Int objects every time around that loop and
having to GC them later. An upcoming new set of analyses and optimizations
will let us get rid of that cost too. And yes, startup will get faster with
time also. In summary, while Rakudo/MoarVM is now winning that benchmark
against Perl 5, there's still lots more to be had. Which is a good job, since
the equivalent Python and Ruby versions of that benchmark still run faster
than the Perl 6 one.
In this post, I'll look at the changes that allowed this benchmark to end up
faster. None of the new work was particularly ground-breaking; in fact, it
was mostly a case of doing small things to make better use of optimizations
we already had.
What happens during construction?

Theoretically, the default new method in turn calls bless, passing the
named arguments along. The bless method then creates an object instance,
followed by calling BUILDALL. The BUILDALL method goes through the set
of steps needed for constructing the object. In the case of a simple object
like ours, that involves checking if an x and y named argument were
passed, and if so assigning those values into the Scalar containers of the
object attributes.
For those keeping count, that's at least 3 method calls (new, bless,
and BUILDALL).
However, there's a cheat. If bless wasn't overridden (which would be an
odd thing to do anyway), then the default new could just call BUILDALL
directly anyway. Therefore, new looks like this:
multi method new(*%attrinit) {
    nqp::if(
      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),
      nqp::create(self).BUILDALL(Empty, %attrinit),
      $bless(|%attrinit)
    )
}

The BUILDALL method was originally a little "interpreter" that went through
a per-object build plan stating what needs to be done. However, for quite some
time now we've instead compiled a per-class BUILDPLAN method.
Slurpy sadness

To figure out how to speed this up, I took a look at how the specializer was
handling the code. The answer: not so well. There were certainly some positive
moments in there. Of note:

The x and y accessor methods were being inlined
The + operators were being inlined
Everything was JIT-compiled into machine code

However, the new method was getting only a "certain" specialization, which
is usually only done for very heavily polymorphic code. That wasn't the case
here; this program clearly construts overwhelmingly one type of object. So
what was going on?
In order to produce an "observed type" specialization - the more powerful kind -
it needs to have data on the types of all of the passed arguments. And for the
named arguments, that was missing. But why?
Logging of passed argument types is done on the callee side, since:

If the callee is already specialized, then we don't want to waste time in
the caller doing throwaway logging
If the callee is not specialized, but the caller is, then we don't want to
miss out on logging types (and we only log when running unspecialized code)

The argument logging was done as the interpreter executed each parameter
processing instruction. However, when there's a slurpy, then it would just
swallow up all the remaining arguments without logging type information. Thus
the information about the argument types was missing and we ended up with a
less powerful form of specialization.
Teaching the slurpy handling code about argument logging felt a bit involved,
but then I realized there was a far simpler solution: log all of the things in
the argument buffer at the point an unspecialized frame is being invoked.
We're already logging the entry to the call frame at that point anyway. This
meant that all of the parameter handling instructions got a bit simpler too,
since they no longer had logging to do.
Conditional elimination

Having new being specialized in a more significant way was an immediate win.
Of note, this part:
      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),

Was quite interesting. Since we were now specializing on the type of self,
then the findmethod could be resolved into a constant. The resolution of a method
on the constant Mu was also a constant. Therefore, the result of the eqaddr
(same memory address) comparison of two constants should also have been turned
into a constant...except that wasn't happening! This time, it was simple: we
just hadn't implemented folding of that yet. So, that was an easy win, and once
done meant the optimizer could see that the if would always go a certain way
and thus optimize the whole thing into the chosen branch. Thus the new method
was specialized into something like:
multi method new(*%attrinit) {
    nqp::create(self).BUILDALL(Empty, %attrinit),
}

Further, the create could be optimized into a "fast create" op, and the
relatively small BUILDALL method could be inlined into new. Not bad.
Generating simpler code

At this point, things were much better, but still not quite there. I took
a look at the BUILDALL method compilation, and realized that it could emit
faster code.
The %attrinit is a Perl 6 Hash object, which is for the most part a
wrapper around the lower-level VM hash object, which is the actual hash
storage. We were, curiously, already pulling this lower-level hash out of
the Hash object and using the nqp::existskey primitive to check if there
was a value, but were not doing that for actually accessing the value.
Instead, an .AT-KEY('x') method call was being done. While that was being
handled fairly well - inlined and so forth - it also does its own bit of
existence checking. I realized we could just use the nqp::atkey primitive
here instead.
Later on, I also realized that we could do away with nqp::existskey and
just use nqp::atkey. Since a VM-level null is something that never exists in
Perl 6, we can safely use it as a sentinel to mean that there is no value.
That got us down to a single hash lookup.
By this point, we were just about winning the benchmark, but only by a few
percent. Were there any more easy pickings?
An off-by-one

My next surprise was that the new method didn't get inlined. It was within
the size limit. There was nothing preventing it. What was going on?
Looking closer, it was even worse than that. Normally, when something is too
big to be inlined, but we can work out what specialization will be called on
the target, we do "specialization linking", indicating which specialization of
the code to call. That wasn't happening. But why?
Some debugging later, I sheepishly fixed an off-by-one in the code that looks
through a multi-dispatch cache to see if a particular candidate matches the
set of argument types we have during optimization of a call instruction. This
probably increased the performance of quite a few calls involving passing
named arguments to multi-methods - and meant new was also inlined.
What next?

The next round of performance improvements - both to this code and plenty more
besides - will come from escape analysis, scalar replacement, and related
optimizations. I've already started on that work, though expect it will keep
me busy for quite a while. I will, however, be able to deliver it in stages,
and am optimistic I'll have the first stage of it ready to talk about in a
week or so.