jnthn/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    DU chains

I've been doing lots of work on the MoarVM specializer of late, and will be
writing a few posts here to explain it. This work has been covered by my
grant from The Perl Foundation.
This post covers the recent addition of DU (Define-Use) chains. I'll explain
what they are, what kind of optimizations they have helped with so far, and
how they can help us ensure the specializer is free of certain kinds of bug.
A little background

The MoarVM specializer helps programs run faster by ripping out as much
unrequired generality as it knows how. This involves a bunch of different
analyses, and those are aided by the program being turned into SSA (Static
Single Assignment) form. This happens not at the Perl 6 program level, but
rather at the bytecode level. MoarVM's interpreter is a register machine, so
we compile something like:
($a + $b) * $c

Assuming these are all native integer variables could compile into something
like:
getlex r0, '$a'
getlex r1, '$b'
add_i r0, r0, r1
getlex r1, '$c'
mul_i r0, r0, r1

Notice how r0 and r1 are re-used for multiple things. Now imagine that
these registers hold objects and we are trying to track types and other such
information. Since a register may be used for completely different things over
its lifetime, we can't just associate information with the register.
Transforming the bytecode into SSA form helps. We give each use of the register
a version number:
getlex r0(1), '$a'
getlex r1(1), '$b'
add_i r0(2), r0(1), r1(1)
getlex r1(2), '$c'
mul_i r0(3), r0(2), r1(2)

Now we can associate information with each version of the register, greatly
easing analysis of the program.
Defines

When a program is in SSA form, every versioned register has one define:
the instruction that writes it. Since it can never be written again anywhere
in the SSA form of the bytecode, the writer is a single instruction. There are
no MoarVM instructions that write more than one register, and so an instruction
defines at most one value, and every versioned register has precisely one
instruction that defines it.
So, defines are easy. For as long back as I can remember, we've stored a
reference to the writing instruction of each versioned register, so whenever
we see a read of it then we can always quickly find it.
Uses

Until recently, we stored a counter of how many times each versioned register
was used. We made an initial pass through the graph representing the bytecode
to be optimized, bumping the usage count each time we saw a versioned register
being used. Then, as we optimized, we could update those counts.
Usage information is especially useful taken together with knowledge of which
instructions are pure - that is to say, they produce a result, but don't
have any effects besides that. If the usage count of such an instruction drops
to zero, then we can simply delete it.
For example, if we have an attribute has str $!value in a class, it would be
compiled into something like this:
  wval              r4(2), liti16(1), liti16(36) (P6opaque: Str)
  getattr_s         r5(1),   r0(2),   r4(2), lits($!value)

The wval instruction grabs the type object of the class that declares the
attribute. This is used together with the attribute name to do a lookup (since
parent and child classes may have attributes of the same name, but they are
different attributes). Provided we know the type of r0(2) - which is holding
self - then we might optimzie it into:
  wval              r4(2), liti16(1), liti16(36) (P6opaque: Str)
  sp_p6oget_s       r5(1),   r8(3), liti16(8)

Where the 8 is an offset in bytes indicating where the attribute lives in the
memory of the object. We've turned a lookup by name into pointer chasing, which
will later JIT into some pretty simple machine code (not quite as simple as we
want yet, but still vastly faster than the normal lookup path).
But wait! What about that wval there? We don't need it now. And so, after the
various optimizations have taken place, we do Dead Instruction Elimination. So
long as the usage count has dropped to zero - that is, nothing else is using the
value - we can delete the instruction, meaning the end result is just:
  sp_p6oget_s       r5(1),   r8(3), liti16(8)

Deoptimization complication

So far, so relatively simple. Alas, there's complications. Some values might
become unused after we optimize the code, but we still can't delete them. We
use statistics to drive optimization, and do a great deal of speculation. For
example, if we see 99% of the time a particular type shows up in the program,
we optimize it assuming that type. But what if the remaining case shows up? In
that case, we drop back to the interpreter to handle it. For that to work out,
however, we must make sure that the values the interpreter needs are still
available for this deoptimization process.
I've done a lot of work on that too, and I'll discuss it in a separate post.
But for now, there's only one thing you need to know: whenever we detected
that a value might be needed if deoptimization happens, we simply gave its
usage count an extra bump. This meant that even if we deleted all of its uses
in the graph, we'd still not delete it.
You can't count on this

The usage count was a fine enough place to start out, but it gradually came to
be insufficient.
One bug that it's quite possible to make is to forget to increment or decrement
the usage count. The cases where it ended up too high could prevent us from
deleting an instruction we didn't need, leading to worse code. This wasn't very
serious, though a bit sub-optimal. The other way round - failing to decrement
the count - is of course more dangerous, since it may lead to an instruction
being deleted that we really need. This didn't happen often - we're relatively
careful - but it'd be nice if we had a way to verify it wasn't happening at
all. However, the +1 for the sake of deoptimization would frustrate doing such
an analysis.
A further issue is that while finding the place that a given versioned register
was defined, there was no cheap way to find its usages. Having such information
would make some optimizations we already did easier and more effective, as well
as make it easier to do some that we're keen to add in the near future.
Beyond that, a single number was uninformative for those of us working on the
optimizer. We could see the number, but what was it telling us? Why was the
register still in use?
Adding use chains

So, instead of storing a count, we've started storing a linked list that points
to each instruction that uses the versioned register. Often we only care about
used, used precisely once, or unused; in fact, the only place we need the exact
count is for debug output. Therefore, we can answer all the common questions we
could with the usage count without having to traverse the list. Building the
chain is easy: everywhere we used to bump the counter, we add an entry into the
chain.
This chain works well for instructions that use the value, but what about the
deoptimization usage? This was handled by storing that piece of information as
a separate flag. It could then be displayed alongside the real usage count in
the debug output, so we could quickly understand which registers were in use
only for the purpose of deoptimization.
Checking the chains

Along with this, I implemented a chain checker. It goes through the instruction
graph and the use chains, and makes sure:

Every versioned register that is used has an entry in the chain
There are no entries in the chain that don't appear in the graph
Every used value has its writing instruction set correctly

This isn't done by default - it costs something - but is available as a flag
MoarVM developers can turn on when implementing new optimizations to aid with
verifying they are, at least in this regard, correct.
Improving elimination of set instructions

Often in Perl 6, code has to deal with both values and values held in Scalar
containers. In the case that we have a Scalar container, we have to remove
the value from it. This is an incredibly common operation in Perl 6, and there
is a single op - decont - that checks if we have a container and takes the
value out of it if so. Code generation conservatively inserts quite a lot of
these.
Often, the answer is that we simply have a value, and so there's nothing to
do. And often, the specializer can tell there will be something to do. Thus,
something like this:
  decont            r5(2),   r0(2)
  findmeth          r4(2),   r5(2), lits(chars)

Is turned into this:
  set               r5(2),   r0(2)
  findmeth          r4(2),   r5(2), lits(chars)

Where set simply sets the value of one register into another. For this and
various other reasons, it's quite common that - after optimizations - we end
up with a bunch of code chock full of set instructions. They're cheap, but
they certainly aren't free - on two counts. Firstly, there's the execution
cost of them. Secondly, they make the optimized code larger than it needs to
be. This both makes less efficient use of the CPU's code cache once we JIT the
optimized result, but also can push the code over the inlining size limit, and
thus it might miss out on further powerful optimizations.
We did have some code to try and get rid of set instructions. It was less
than awesome on multiple counts. Firstly, it still left quite a few behind
that we could see by inspection of the code could go away. Secondly, it could
make a mess of the SSA form. Since it was one of the very last optimizations
we did, that wasn't a big deal, but it did make the debug output confusing,
plus we will be adding more optimizations to this second pass in the future.
Thirdly, it was somewhat adhoc, mostly written to handle peephole patterns
that commonly showed up.
The usage chains provide a way to do better. The new set elimination
algorithm covers the previous cases and new ones, and does exactly two
things.
Firstly, it looks if the writer of the set's second operand has only one
usage, which is that set instruction, and no deopt usages. If so, and if
there are no interfering uses of different versions of the register that
the set writes, then it can have the writing instruction changed to
write to the register that the set would, and the set instruction can
then be deleted.
Failing that, it uses the use chain to check if there is a single user of
the versioned register that the set instruction writes to. Again, given
no conflicts, it can eliminate the set instruction by arranging for the
user of the set instruction to instead use the value that the set would
read. So in our case:
  set               r5(2),   r0(2)
  findmeth          r4(2),   r5(2), lits(chars)

We'd end up with:
  findmeth          r4(2),   r0(2), lits(chars)

To give a practical example of this, here is the code that the chars method
called on a Scalar holding a Str looks after optimization without the set
elimination:
  sp_getarg_o       r1(2), liti16(0)
  set               r8(2),   r1(2)
  set               r1(3),   r8(2)
  [Annotation: Logged (bytecode offset 24)]
  sp_p6oget_o       r8(3),   r1(3), liti16(16)
  [Annotation: INS Deopt One (idx 0 -> pc 30; line 2838)]
  sp_guardconc      r8(3), sslot(1), litui32(30)
  set              r11(2),   r8(3)
  set               r0(2),  r11(2)
  [Annotation: Line Number: SETTING::src/core/Str.pm6:2838]
  takedispatcher    r3(2)
  sp_p6oget_s       r5(1),   r0(2), liti16(8)
  chars             r6(1),   r5(1)
  hllboxtype_i      r4(3)
  [Annotation: INS Deopt One (idx 1 -> pc 134; line 2839)]
  box_i             r4(4),   r6(1),   r4(3)
  return_o          r4(4)

Notice the four set instructions in there. With the new set elimination
algorithm, we end up with:
  sp_getarg_o       r1(3), liti16(0)
  [Annotation: Logged (bytecode offset 24)]
  sp_p6oget_o       r8(3),   r1(3), liti16(16)
  [Annotation: INS Deopt One (idx 0 -> pc 30; line 2838)]
  sp_guardconc      r8(3), sslot(1), litui32(30)
  [Annotation: Line Number: SETTING::src/core/Str.pm6:2838]
  takedispatcher    r3(2)
  sp_p6oget_s       r5(1),   r8(3), liti16(8)
  chars             r6(1),   r5(1)
  hllboxtype_i      r4(3)
  [Annotation: INS Deopt One (idx 1 -> pc 134; line 2839)]
  box_i             r4(4),   r6(1),   r4(3)
  return_o          r4(4)

Elimination of box/unbox pairs

Another interesting use of DU chains is to eliminate boxing of native values
into objects only to unbox them again a short time later. This can happen due
to the compiler not being smart enough, but if it happens across two subs or
methods, and especially when we have multiple dispatch and polymorphic method
dispatch happening, there's not so much we could do better at that phase.
However, MoarVM does inlining. We can therefore see between boundaries that
we cannot at compile time. Recall how chars produced this boxing code, as
it is declared to return an Int:
  chars             r6(1),   r5(1)
  hllboxtype_i      r4(3)
  box_i             r4(4),   r6(1),   r4(3)

What if we were to write:
my int $chars = $str.chars;

Then the boxing happens just over the boundary. It turns out that there's quite
a lot to do in order to get rid of the boxing instruction, but with use chains
we can already make a start. When we encounter a box, we take a look at if
it has any users that are an unbox. After inlining, we'd see that there is.
Therefore, that unbox instruction can be rewritten to use r6(1) - the
unboxed value.
That much works now. For reasons I'll dig into in my next post, that's not yet
quite enough to eliminate the box_i instruction. So in this case, the saving
is minor. Once we can get rid of that, however, it will be a notable saving in
such cases.
Coming in the future: native ref/deref pairs

One current challenge we have is that if we call a method and pass it a variable
declared with a native type:
my int $foo = $a + $b;
$obj.meth($foo);

Then we don't know if that method is declared as taking an rw parameter or
not. Therefore, we must not pass a native integer value, but instead form a
reference that points to where $foo lives, so we can update it. Of course,
in most cases is rw is not used.
After inlining, we'll be able to see this, and so will be able to use the use
chain to discover when a formed reference is used for nothing more than to do
a dereference. Then we can eliminate that cost.
In summary

Adding use chains has allowed us to detect and fix a small number of usage
handling bugs, given us a way to prevent such bugs happening in the future,
allowed us to improve an existing optimization, provided for efficiently
implementing a new one, and will be an important part of improving the
performance of code using native types in the future. Furthermore, it means
those of us working on MoarVM have more detailed information about why an
operation has to take place in the optimized code, so we can better understand
if we have missed opportunities.
However, that's not the end of the usage story. It turned out that a single
flag for deopt usage would not suffice. Next time, I'll look at why, and what
I've done to address that.