I've been doing lots of work on the MoarVM specializer of late, and will be writing a few posts here to explain it. This work has been covered by my grant from The Perl Foundation.
This post covers the recent addition of DU (Define-Use) chains. I'll explain what they are, what kind of optimizations they have helped with so far, and how they can help us ensure the specializer is free of certain kinds of bug.
The MoarVM specializer helps programs run faster by ripping out as much unrequired generality as it knows how. This involves a bunch of different analyses, and those are aided by the program being turned into SSA (Static Single Assignment) form. This happens not at the Perl 6 program level, but rather at the bytecode level. MoarVM's interpreter is a register machine, so we compile something like:
($a + $b) * $c
Assuming these are all native integer variables could compile into something like:
getlex r0, '$a'
getlex r1, '$b'
add_i r0, r0, r1
getlex r1, '$c'
mul_i r0, r0, r1
Notice how r0
and r1
are re-used for multiple things. Now imagine that
these registers hold objects and we are trying to track types and other such
information. Since a register may be used for completely different things over
its lifetime, we can't just associate information with the register.
Transforming the bytecode into SSA form helps. We give each use of the register a version number:
getlex r0(1), '$a'
getlex r1(1), '$b'
add_i r0(2), r0(1), r1(1)
getlex r1(2), '$c'
mul_i r0(3), r0(2), r1(2)
Now we can associate information with each version of the register, greatly easing analysis of the program.
When a program is in SSA form, every versioned register has one define: the instruction that writes it. Since it can never be written again anywhere in the SSA form of the bytecode, the writer is a single instruction. There are no MoarVM instructions that write more than one register, and so an instruction defines at most one value, and every versioned register has precisely one instruction that defines it.
So, defines are easy. For as long back as I can remember, we've stored a reference to the writing instruction of each versioned register, so whenever we see a read of it then we can always quickly find it.
Until recently, we stored a counter of how many times each versioned register was used. We made an initial pass through the graph representing the bytecode to be optimized, bumping the usage count each time we saw a versioned register being used. Then, as we optimized, we could update those counts.
Usage information is especially useful taken together with knowledge of which instructions are pure - that is to say, they produce a result, but don't have any effects besides that. If the usage count of such an instruction drops to zero, then we can simply delete it.
For example, if we have an attribute has str $!value
in a class, it would be
compiled into something like this:
wval r4(2), liti16(1), liti16(36) (P6opaque: Str)
getattr_s r5(1), r0(2), r4(2), lits($!value)
The wval
instruction grabs the type object of the class that declares the
attribute. This is used together with the attribute name to do a lookup (since
parent and child classes may have attributes of the same name, but they are
different attributes). Provided we know the type of r0(2)
- which is holding
self
- then we might optimzie it into:
wval r4(2), liti16(1), liti16(36) (P6opaque: Str)
sp_p6oget_s r5(1), r8(3), liti16(8)
Where the 8
is an offset in bytes indicating where the attribute lives in the
memory of the object. We've turned a lookup by name into pointer chasing, which
will later JIT into some pretty simple machine code (not quite as simple as we
want yet, but still vastly faster than the normal lookup path).
But wait! What about that wval
there? We don't need it now. And so, after the
various optimizations have taken place, we do Dead Instruction Elimination. So
long as the usage count has dropped to zero - that is, nothing else is using the
value - we can delete the instruction, meaning the end result is just:
sp_p6oget_s r5(1), r8(3), liti16(8)
So far, so relatively simple. Alas, there's complications. Some values might become unused after we optimize the code, but we still can't delete them. We use statistics to drive optimization, and do a great deal of speculation. For example, if we see 99% of the time a particular type shows up in the program, we optimize it assuming that type. But what if the remaining case shows up? In that case, we drop back to the interpreter to handle it. For that to work out, however, we must make sure that the values the interpreter needs are still available for this deoptimization process.
I've done a lot of work on that too, and I'll discuss it in a separate post. But for now, there's only one thing you need to know: whenever we detected that a value might be needed if deoptimization happens, we simply gave its usage count an extra bump. This meant that even if we deleted all of its uses in the graph, we'd still not delete it.
The usage count was a fine enough place to start out, but it gradually came to be insufficient.
One bug that it's quite possible to make is to forget to increment or decrement the usage count. The cases where it ended up too high could prevent us from deleting an instruction we didn't need, leading to worse code. This wasn't very serious, though a bit sub-optimal. The other way round - failing to decrement the count - is of course more dangerous, since it may lead to an instruction being deleted that we really need. This didn't happen often - we're relatively careful - but it'd be nice if we had a way to verify it wasn't happening at all. However, the +1 for the sake of deoptimization would frustrate doing such an analysis.
A further issue is that while finding the place that a given versioned register was defined, there was no cheap way to find its usages. Having such information would make some optimizations we already did easier and more effective, as well as make it easier to do some that we're keen to add in the near future.
Beyond that, a single number was uninformative for those of us working on the optimizer. We could see the number, but what was it telling us? Why was the register still in use?
So, instead of storing a count, we've started storing a linked list that points to each instruction that uses the versioned register. Often we only care about used, used precisely once, or unused; in fact, the only place we need the exact count is for debug output. Therefore, we can answer all the common questions we could with the usage count without having to traverse the list. Building the chain is easy: everywhere we used to bump the counter, we add an entry into the chain.
This chain works well for instructions that use the value, but what about the deoptimization usage? This was handled by storing that piece of information as a separate flag. It could then be displayed alongside the real usage count in the debug output, so we could quickly understand which registers were in use only for the purpose of deoptimization.
Along with this, I implemented a chain checker. It goes through the instruction graph and the use chains, and makes sure:
- Every versioned register that is used has an entry in the chain
- There are no entries in the chain that don't appear in the graph
- Every used value has its writing instruction set correctly
This isn't done by default - it costs something - but is available as a flag MoarVM developers can turn on when implementing new optimizations to aid with verifying they are, at least in this regard, correct.
Often in Perl 6, code has to deal with both values and values held in Scalar
containers. In the case that we have a Scalar
container, we have to remove
the value from it. This is an incredibly common operation in Perl 6, and there
is a single op - decont
- that checks if we have a container and takes the
value out of it if so. Code generation conservatively inserts quite a lot of
these.
Often, the answer is that we simply have a value, and so there's nothing to do. And often, the specializer can tell there will be something to do. Thus, something like this:
decont r5(2), r0(2)
findmeth r4(2), r5(2), lits(chars)
Is turned into this:
set r5(2), r0(2)
findmeth r4(2), r5(2), lits(chars)
Where set
simply sets the value of one register into another. For this and
various other reasons, it's quite common that - after optimizations - we end
up with a bunch of code chock full of set
instructions. They're cheap, but
they certainly aren't free - on two counts. Firstly, there's the execution
cost of them. Secondly, they make the optimized code larger than it needs to
be. This both makes less efficient use of the CPU's code cache once we JIT the
optimized result, but also can push the code over the inlining size limit, and
thus it might miss out on further powerful optimizations.
We did have some code to try and get rid of set
instructions. It was less
than awesome on multiple counts. Firstly, it still left quite a few behind
that we could see by inspection of the code could go away. Secondly, it could
make a mess of the SSA form. Since it was one of the very last optimizations
we did, that wasn't a big deal, but it did make the debug output confusing,
plus we will be adding more optimizations to this second pass in the future.
Thirdly, it was somewhat adhoc, mostly written to handle peephole patterns
that commonly showed up.
The usage chains provide a way to do better. The new set
elimination
algorithm covers the previous cases and new ones, and does exactly two
things.
Firstly, it looks if the writer of the set
's second operand has only one
usage, which is that set instruction, and no deopt usages. If so, and if
there are no interfering uses of different versions of the register that
the set
writes, then it can have the writing instruction changed to
write to the register that the set
would, and the set
instruction can
then be deleted.
Failing that, it uses the use chain to check if there is a single user of
the versioned register that the set
instruction writes to. Again, given
no conflicts, it can eliminate the set instruction by arranging for the
user of the set
instruction to instead use the value that the set
would
read. So in our case:
set r5(2), r0(2)
findmeth r4(2), r5(2), lits(chars)
We'd end up with:
findmeth r4(2), r0(2), lits(chars)
To give a practical example of this, here is the code that the chars
method
called on a Scalar
holding a Str
looks after optimization without the set
elimination:
sp_getarg_o r1(2), liti16(0)
set r8(2), r1(2)
set r1(3), r8(2)
[Annotation: Logged (bytecode offset 24)]
sp_p6oget_o r8(3), r1(3), liti16(16)
[Annotation: INS Deopt One (idx 0 -> pc 30; line 2838)]
sp_guardconc r8(3), sslot(1), litui32(30)
set r11(2), r8(3)
set r0(2), r11(2)
[Annotation: Line Number: SETTING::src/core/Str.pm6:2838]
takedispatcher r3(2)
sp_p6oget_s r5(1), r0(2), liti16(8)
chars r6(1), r5(1)
hllboxtype_i r4(3)
[Annotation: INS Deopt One (idx 1 -> pc 134; line 2839)]
box_i r4(4), r6(1), r4(3)
return_o r4(4)
Notice the four set
instructions in there. With the new set
elimination
algorithm, we end up with:
sp_getarg_o r1(3), liti16(0)
[Annotation: Logged (bytecode offset 24)]
sp_p6oget_o r8(3), r1(3), liti16(16)
[Annotation: INS Deopt One (idx 0 -> pc 30; line 2838)]
sp_guardconc r8(3), sslot(1), litui32(30)
[Annotation: Line Number: SETTING::src/core/Str.pm6:2838]
takedispatcher r3(2)
sp_p6oget_s r5(1), r8(3), liti16(8)
chars r6(1), r5(1)
hllboxtype_i r4(3)
[Annotation: INS Deopt One (idx 1 -> pc 134; line 2839)]
box_i r4(4), r6(1), r4(3)
return_o r4(4)
Another interesting use of DU chains is to eliminate boxing of native values into objects only to unbox them again a short time later. This can happen due to the compiler not being smart enough, but if it happens across two subs or methods, and especially when we have multiple dispatch and polymorphic method dispatch happening, there's not so much we could do better at that phase.
However, MoarVM does inlining. We can therefore see between boundaries that
we cannot at compile time. Recall how chars
produced this boxing code, as
it is declared to return an Int
:
chars r6(1), r5(1)
hllboxtype_i r4(3)
box_i r4(4), r6(1), r4(3)
What if we were to write:
my int $chars = $str.chars;
Then the boxing happens just over the boundary. It turns out that there's quite
a lot to do in order to get rid of the boxing instruction, but with use chains
we can already make a start. When we encounter a box
, we take a look at if
it has any users that are an unbox
. After inlining, we'd see that there is.
Therefore, that unbox
instruction can be rewritten to use r6(1)
- the
unboxed value.
That much works now. For reasons I'll dig into in my next post, that's not yet
quite enough to eliminate the box_i
instruction. So in this case, the saving
is minor. Once we can get rid of that, however, it will be a notable saving in
such cases.
One current challenge we have is that if we call a method and pass it a variable declared with a native type:
my int $foo = $a + $b;
$obj.meth($foo);
Then we don't know if that method is declared as taking an rw
parameter or
not. Therefore, we must not pass a native integer value, but instead form a
reference that points to where $foo
lives, so we can update it. Of course,
in most cases is rw
is not used.
After inlining, we'll be able to see this, and so will be able to use the use chain to discover when a formed reference is used for nothing more than to do a dereference. Then we can eliminate that cost.
Adding use chains has allowed us to detect and fix a small number of usage handling bugs, given us a way to prevent such bugs happening in the future, allowed us to improve an existing optimization, provided for efficiently implementing a new one, and will be an important part of improving the performance of code using native types in the future. Furthermore, it means those of us working on MoarVM have more detailed information about why an operation has to take place in the optimized code, so we can better understand if we have missed opportunities.
However, that's not the end of the usage story. It turned out that a single flag for deopt usage would not suffice. Next time, I'll look at why, and what I've done to address that.