Dynamic lookups and context introspection with inlining
Inlining is one of the most important optimizations that MoarVM performs.
Inlining lets us replace a call to some
Method with the
code that is inside of it. The most immediate benefit is to eliminate the
overhead of calling, but that's just the start. Inlined code has often already
been specialized for a certain set of argument types. If we already have
proven those argument types in the caller, then there's no need to re-check
them. Inlining can also expose pairs of operations that can cancel, such as
box/unbox, and bring the point a control exception is thrown into the scope
where it is caught, which may allow the exception throw to be rewritten to a
In a language like Perl 6, where every operator is a call to a multiple dispatch subroutine, inlining can be a significant win. In the best cases, inlining can lead to smaller code, because the thing that is inlined ends up being smaller than the bytecode for the call sequence. Of course, often it leads to bigger code, and so there's limits to how much of it we really want to do. But still, we've been gradually pushing on with increasing the range of things that we're able to inline.
The problem with inlining is that the very call boundaries it does away with may carry semantic significance. In this post, I'll talk about a couple of operations that became problematic as we ramped up our inlining capabilities, and discuss a new abstraction I recently added to MoarVM - the frame walker - which provides a common foundation for solving them.
A little inlining history
Inlining first showed up in MoarVM back in 2014, not too many months after the type-specializing optimizer was added. MoarVM has done speculative optimizations from the start, performing deoptimization (falling back to the interpreter) in the case that an unexpected situation shows up. But what if such a situation showed up in code that had been inlined? Then we'd have to pretend we never did the inlines! Therefore, MoarVM can uninline too - that is, untangle the results of inlining and produce a call stack as if we'd been running the unoptimized code.
We also from the start supported nested inlines - that is, inlining things that
themselves contained inlines. However, this early inlining was restricted in
what it could handle. The first implementation could not inline anything with
exception handlers, though that was supported within a couple of months. But
still, it could not inline closures. Only
subs in the outermost scope or
CORE.setting and methods were possible to inline, because those
were the only cases where we had enough information about what was being
called, which is a decided prerequisite for inlining it.
Aside from bug fixes, things stayed the same until 2017. The focus in that time largely switched away from performance and towards the Perl 6.c release, which was in itself exhausting. Summer of 2017 brought some very large changes to how spesh worked, moving optimization to a background thread, along with changing and extending the statistics that were collected. A new kind of call optimization became possible, whereby if we could not prove what we were going to call, but the statistics showed a pattern, then we could insert a guard and speculatively optimize the call. Speculative inlining fell neatly out of that. Suddenly, a bunch more things could be considered for inlining.
Further work lifted some of the inlining restrictions. Deoptimization learned
how to cope if we deoptimized in the middle of processing named arguments. It
became possible to inline various kinds of closure by rewriting the lexical
lookup operations into an indirection through the code object of the code that
we had inlined. It also became possible to inline code involving lexical throws
of exceptions and their handlers. Since that is how
return works in Perl 6,
that again made more things possible to inline. A more fine-grained analysis
allowed us to do some amount of cross-language inlining, meaning bits of the
Rakudo internals written in NQP could be inlined into the Perl 6 code calling
them. I'll add at this point that while it's easy to write a list of these,
realizing various of them was quite challenging.
Now it's summer 2018, and there's some more advances. Previously, we would only do an inlining if we already had produced a specialized version of the callee. This usually worked out, and we sorted by maximum call stack depth and specialized deepest first to help with that. However, sometimes that was not enough, and we missed inlining opportunities. So, during the last month, I added support for producing code to inline on-demand. I also observed that we were only properly doing speculative (that is, based on statistics) inlines of calls made that were expected to return an object, but not those in void context. (If that sounds like an odd oversight, it's because void calls were previously rare. It was only during the last month, when I improved code-gen to spot a lot more opportunities to spit out void context calls, that we got a lot more of them and I spotted the problem.)
More is better, no?
Being able to inline a wider range of calls is a good thing. However, it also
made it far more likely that we would run into constructs that don't cope well
with inlining. We've got a long way by marking ops that we know won't cope well
with it as
:noinline (and then gradually liberalizing that over time). The
improvements over the previous month created a more difficult problem, however.
We have a number of ops that allow for introspection of the call stack. These
are used to implement Perl 6 features such as the
However, they are also the way that
$/ can be set by things like
ctx op as
:noinline got us a long way, but we ran into trouble
because once you have a context handle you could then start traversing from
it to callers, and then starting a lookup relative to that point. But what if
the caller was an inline? Then we don't have a callframe to reference in the
context object that we return.
A further problem was that a non-introspection form of dynamic lookup, which traverses the lexical chain hanging off each step of the dynamic chain, also was not aware of inlines. In theory, this would have become a problem when we started doing inlining of closures last year. However, since it is used in a tiny number of places, and those places didn't permit inlining, we didn't notice until this month, when inlining started to cover more cases.
Normal dynamic lookup, used for
$*foo style variables, has been inline-aware
for about as long as we've had inlining. However, this greatly complicated the
lookup code. Replicating such inline compensation code in a bunch of places was
clearly a bad idea. It's a tricky problem since we're effectively trying to
model a callstack that doesn't really exist by using information telling us
what it would look like if it did. It's the same problem that deoptimization
has to solve, except this time we're just imagining what the call stack would
look like unoptimized, not actually trying to recreate it. It's certainly not
a problem we want solved repeatedly around the VM's implementation.
A new frame walker
To help tackle all of these problems, I introduced a new abstraction: the specialization-aware frame walker. It provides an iterator over the call stack as if no inlining had taken place, figuring out as much as it needs to in order to recreate the information that a particular operation wants.
First, I used it to make caller-dynamic lookup inline-aware. That went pretty well, and immediately fixed one of the module regressions that was "caused" by the recent inlining improvements.
Next, I used it to refactor the normal dynamic lookup. That needed careful work unteasing the details of dynamic lookup and caching of dynamic variable lookups from the inlining traversal. However, the end result was far simpler code, with much less duplication, since the JITted inline, interpreted inline, and non-inline paths largely collapsed and were handled by the frame walker.
Next up, contexts. Alas, this would be trickier.
Previously, context traversal had worked eagerly. When we asked for the an object representing the caller or outer of a particular context, we immediately walked one frame in the appropriate direction and produced a result. Of course, this did not go well if there were inlines, since it always walked one real frame, but that may have multiple inlined frames within it.
One possibility was to make the
ctx op immediately deoptimize the whole call
stack, so that we then had a chain of real call frames to traverse. However,
when I looked at some of the places using
ctx, it became clear this would
have very made consequences: every regex match causing a global deopt was not
going to work!
Another option, that would largely preserve the existing design, was to store extra information about the inline we were inside of. This, however, had the weakness that we might do a real deoptimization between the two points, thus invalidating the information. That was probably also possible to fix up, but the complexity of doing so put me off that approach.
Instead, I switched to a model where "move to caller" and "move to outer" would be stored as displacements to apply when the context object was used in order to obtain information. The frame walker could make these movements, before doing whatever lookup was then required. Thus, even if a deopt takes places between obtaining the context and using it, we can still do a correct traversal.
Too much laziness
This helped, but wasn't quite enough either. If a context handle was taken and used immediately, things worked out fine. However, if the situation was like this:
+-------------------------+ | Frame we used ctx op on | (Frame 1) +-------------------------+ | Caller | v +-------------------------+ | Frame with inlines | (Frame 2) +-------------------------+
And then we used the handle some time later, things worked out less well. The problem was that I used the current return address of Frame 2 in order to understand which inline(s) we were inside of. However, if we'd executed more code in Frame 2, then it could have made another call. The current return address could thus point to the wrong inline. Oops.
However, since the
ctx operation at the start of the lookup is never inlined,
and a given call frame can only ever be called from one location in the caller,
there's a solution. If the
ctx op is used to get a first-class reference to
a frame on the call stack, we walk down the call stack and make sure that each
frame called from inlined code preserves enough location information that we
can later reconstruct what we need. It only needs to walk down the call stack
until it sees a point where another
ctx operation already preserved that
information, so in programs with lots of use of the
ctx op, we should avoid
doing full stack walks each time, and just walk over recently created frames.
With those changes, the various Perl 6 modules exhibiting lookup problems since this month's introduction of more aggressive inlining were fixed. Along the way, a common mechanism was introduced allowing us to walk a call stack as if no inlines had taken place. There's at least one more place that we can use this: in order to make stack trace output not be sensitive to inlining. I'll get to that in the coming weeks, or it might make a nice task for somebody looking to get themselves (more) involved with MoarVM development. Last but not least, I'd like to once again thank The Perl Foundation for organizing the funding that made this work possible.