jnthn/x.markdown Secret

## x.markdown

      
    Raw
  

              x.markdown
            
          
    How MoarVM became the Rakudo backend of choice

The first day of the advent calendar remarked that during 2014, MoarVM became the
de facto standard virtual machine to run Perl 6 on. And, with Perl 6 on course to
hit 6.0.0 in 2015, MoarVM is what most people adopting Perl 6 at this point will be
working with. In this post, we'll take a look at what MoarVM brings to the table,
and where it's heading from here.
But first - why a VM?

Many modern languages depend on a good deal of runtime support. Perl 6 is very much
in this camp, and the same goes for Perl 5. A VM is in the business of providing
that support. With the Perl 5, the word "interpreter" tends to be used over VM, but
the goal is the same: provide the things needed to execute programs in the language
across disparate hardware and operating system environments. The difference is mostly
that we tend to use VM when there's a clean separation of the language compiler and
the runtime support - meaning that the VM might also be used to execute other languages.
This has happened with the JVM, for example. MoarVM is squarely aimed at being a good
host for the Rakudo Perl 6 compiler, and the NQP subset of Perl 6 that Rakudo is mostly
implemented in. If any other language finds it useful, that's fine, but it's not a
primary design consideration.
So why care so much over a clean interface between compiler and runtime? Because, in a
project the size of "implement Perl 6", managing implementation complexity is crucial.
Establishing a clean boundary between the compiler and the runtime, and knowing what each
side is responsible for, is one strategy for doing that. (Rakudo's compilation process is
also broken up into a pipeline of stages that only communicate by passing well-defined data
structures from one to the next, which is another form of architectural boundary.)
But why interpreters and VMs at all? Why not go straight down to the metal? For any
language where you can reasonably figure out a lot about the program statically, that
is a fine strategy. The information needed to eliminate many forms of late binding, checks,
and duplicate work is to hand at compile time, and a decent optimizer will be able to use
that information. Of course, this approach restricts you to what you can see at compile
time. Some compilers as a result support profile-guided optimization, where you run the
program on typical workloads, collect information about where time is spent and what
code-paths are common and uncommon, and then compile the program again taking that data
into account.
Profile-guided optimization hints at a strategy for efficiently executing languages where
you can know rather less at compile time: put off optimization until you observe the
program running. This isn't a new idea, and there has been a massive amount of work done
in this area. At its heart is the general principle that most programs are, if I may pun
on a term from the database field, "eventually static". Even though a Perl 6 program may
have a huge number of points of potential dynamism and genericity, in most programs just
a handful of them actually ever exploit it. Of course, different runs of the program on
different inputs may tickle different bits of that dynamism, not to mention that as a
program grows and evolves, these flexible points will be exploited here and there to
(hopefully cleanly) add new functionality.
A modern VM aiming to support a language like Perl 6 is not, therefore, just there to
run the code and provide services like garbage collection, OS abstraction, and so on.
It's job is to take a program in all its dynamic glory, watch what code paths it
actually takes, see what types really show up - and then produce optimized versions
of carefully selected parts of the program, with the unused dynamism ruthlessly torn
out, and much of the remaining dynamism massaged more restricted but faster forms.
Bytecode specialization

So what exactly does MoarVM do with a program? First, it needs to find code
worth considering optimizing. This is done by seeking the hot parts of a program:
routines that are called again and again, and loops that are doing a lot of
iterations. Since time to optimize a routine is proportional to the size of
the routine, the threshold for "it's worth considering" goes with code size.
A small routine - such as a simple operator or an accessor method - has rather
little code and will become hot quickly. A larger routine heats up more slowly.
It's just like our kettle: a little water boils quickly, but fill it up and
you're going to be waiting a bit for that cuppa. This is in part about risk
management: we would prefer to avoid investing time optimizing code that's
never going to give a positive return on investment. We can't predict the
future, but some cheap, simple heuristics are at least a good start.
So, we've got some hot code. What now? Well, if it's a call, we start by
looking at the callsites that are invoking it. The callsite tells us how
many arguments are being passed, whether they are native values or objects,
and - for named arguments - what their names are. We can then start to
produce bytecode specialized by callsite. Know that the object parameter is
passed an object argument? Then replace the op that checks if we need to box
the incoming argument with a simpler op that just copies the pointer. Know
that an optional parameter actually doesn't get passed? Then strip out the
code to look for it, toss the branch, and just always run the code that
sets the default value. Know that the named parameter "config" is always
the first passed named argument? Then retrieve it by indexed access (so a
cheap pointer copy), rather than messing about with strings.
Next up: types. If we are passed arguments, what types are they? If we look
up constants, what types to those have? Knowing these is relatively easy:
we just peek at the args buffer to see what's being passed, or look up the
constant and see what type it is. However, knowing them also presents us with
a huge range of opportunities. See that method call? It doesn't need to look
up the method in a table each call any more, since we know where it's going.
And that attribute access? Since we know the object layout, it can just become
a pointer offset from the object's memory address. That type check? We often
know the answer statically now - and can eliminate entire branches that won't
be taken. Oh, and that multiple dispatch? Let's just resolve it once now we
know the types we're dealing with, not every single call. The list goes on.
Specialization works at the bytecode level, because that's what the VM gets.
And a specialization of a given piece of bytecode comes with a set of guards:
conditions that must be met for it to be applicable. Those can constrain it
by callsite and argument types. And we can produce multiple specializations
for a given piece of original bytecode.
Of course, choosing the appropriate specialization is a cost too. It would
eat into our winnings if we had to do that on every call. Thankfully, there
are ways to avoid that. If we are optimizing a call to something and know the
argument types being passed, we can pick the specialization of the callee and
statically compile the caller to always call that specialization, rather than
having to hunt for the right one. And, if the callee is rather small, we can
often do away with the call entirely and simply inline the callee into the
caller - eliminating the need to create a callframe and keeping the code tighter,
which CPU caches like.
Optimization speculation

So, all this is nice, but we want MOAR MOAR MOAR. Because there are lots of
other things that we can't statically figure out the types of, but that may
end up having very stable types come runtime. Some examples are lexical
variables that we close over, object attributes, and return values of our
callees. So how do we get hold of these types?
When code becomes hot, we don't actually optimize it right away. Well, we
do for the simple callsite-related transformations. But before going and
doing the type-driven stuff, we take a closer look at what's actually going
on in the program. The specializer quickly produces an instrumented version
of the program that records what types show up where. This runs for the next
bunch of invocations, and logs what it sees. After a threshold is hit, we
have a look at the types that showed up and do a little analysis. If a given
logging site consistently logs the same type, we're onto something. If we
see more than one type show up there, we consider that part of the program
as being typically dynamic and leave it be.
The thing is, we didn't actually prove a property of the program here, just
profiled it and figured out what tends to happen. So how can we use this
information? The trick is to insert a guard into the specialized program
that cheaply asserts that we got the expected type. Beyond that, we can
optimize assuming we really did. (We also make sure that we actually do
an optimization based upon what the guard checks, and don't bother to
insert it otherwise.)
Which is all well and good when the guard is met, but what about when the
assertion fails? In fact, this is a more general problem. Even the far
simpler type-based optimizations on incoming parameters assume that an
object's type will not change - and of course, thanks to mixins, they can.
All of these situations trigger deoptimization: replacing the optimized
code we're in the midst of running with the unoptimized, naive, but safe
version that is dynamic enough to handle all the cases.
Deoptimization is quite a pain to get right. For one, if we were inside
of inlined code, we have to back and create all the callframes we skipped.
Since inlining can be many levels deep, that can be a bunch of callframes to
wire up. Then we have to make sure that our optimizations don't throw away
code that produces a value the deoptimized code relies on being there, but
the optimized code would otherwise not. Of course, we still need to do dead
code elimination in general; it's just that it has to be aware of deoptimization.
Therefore, it's not only important that MoarVM can optimize code. It's also
critical to its ability to do so that it can also deoptimize, falling back
to slow paths when needed.
Down to machine code

Interpreting specialized bytecode can yield, in some programs, a significant
improvement. The simple program:
my $i = 0; while ++$i <= 100000000 { }

Got a factor of 2.5 improvement with what we had several months ago, and there
are still a lot of opportunities left. However, interpreting has its overhead.
Before specialization eliminates a lot of the dynamism, it's only so bad; some
ops have some checks to do. But specialized code tends to lead to code with a
lot of very cheap operations that just play with pointers - and then the overhead
of interpreting can quickly come to account for the majority of the time.
Therefore, on x64, we can often take the further step of producing machine code
to run, instead of specialized bytecode to interpret. Of course, we have to be
ready to deoptimize from that too. I'll save talking about this to much, since
the guy who worked on it is, I believe, plotting an advent post about it in the
near future. Needless to say, it makes a dramatic difference; for the benchmark
above it more than halved the time again relative to the specialized bytecode.
We often referred to the "produce machine code" part of the work as "the JIT".
But really, the entire set of optimizations described in this article are part
of the typical work a JIT compiler does. Producing machine code is just one small
part, and it's interesting to note that the win from eliminating dynamism was a
bigger one than eliminating interpretation overhead.
What next?

The work so far has had a dramatic impact on performance. However, there's still
plenty of areas for improvement.
For one, the number of specializations produced and how we choose them can certainly
use some tweaking. It's possible for a routine to saturate its number of allowed
specializations with ones that, in the long run, are not entirely strategic. Or we
may be able to reasonably argue that it should be allowed more. At the same time,
specialized bytecode is a notable memory overhead at this point. So, we need to
produce before more and less specializations. Of course, what that really means is
we need a better strategy for picking what is worth doing.
Another clear issue is that some speculative optimizations turn out to be bad bets,
and we have no way of recovering from that. We can deoptimize pretty fast (in fact,
some trade-offs have been made to have it be quite affordable). But it's still a
cost, and it's possible a gamble that looked good based on available data may lead
to code that has to be deoptimized a majority of the time. Again, it's about being
smarter and keeping on learning as we run the program.
Then there's a bunch of things that we don't specialize or compile to machine code
in a good way yet. Array access is an obvious one, and big integer operations - which
are at the heart of Perl 6's Int type - is another. The specializer is also not as
aware as it really should be of closures. These are all areas where we could do a
whole lot better - and they are nice, incremental advances on what we already have.
A further interesting area for exploration is doing escape analysis, and being able
to eliminate a bunch of GC allocations in favor of allocating the memory as part of
the call frame, because we know it can never end up being used beyond it. This is a
fair amount of work, but also highly relevant to Perl 6: many scalar containers for
lexical variables would be great candidates.
Then there's the whole idea of trace JIT. MoarVM so far has a routine-level JIT,
often called a method JIT. There, we optimize at routine level first, and may do
some inlining to flatten away the call tree. This is a pretty decent strategy for
Perl 6, in so far as we're often very interested in inlining primitive operators
that, naively interpreted in the original dynamic program, would be multiple
dispatch calls. Trace JITs, by contrast, just follow the flow of instructions,
over however many callframes, and every single conditional branch becomes a
guard. Their strength is loops, which today we cope with using On Stack
Replacement (detect we're in a hot loop, compile an optimized version of it,
and hot-swap it). While it's easy to talk about trace JIT vs. method JIT as
a "pick one", I'm much more interested in the "pick both" track. They both,
after all, share a lot of infrastructural needs - and they have their strengths
and weaknesses. There's more than one way to dynamically optimize it, and, this
being Perl, we might as well just do all the ways - spotting all the nice
unifications and re-use opportunities along the way.