jnthn/x.md Secret

## x.md

      
    Raw
  

              x.md
            
          
    Framing the problem

In this post I'll be talking a lot about call frames, also known as invocation
records. Just to be clear about what they are, consider a sub:
sub mean(@values) {
    @values.sum / @values
}

Whenever we call mean, we create a call frame. This holds the storage
for the incoming @values parameter. It also holds some temporary storage we
use in executing the sub, holding, for example, the sum method object we get
back when looking up the method, and the result of calling @values.sum, which
we then pass to infix:</>. Call frames also record outer and caller
references (so we can resolve lexical and dynamic variables), the place to store
the return value and go to on return, and other bits. It's important to note
that call frames are not 1:1 with subs/methods/blocks. Perhaps the best
way to understand why is to consider a recursive sub:
sub fac($n) {
    $n <= 1
        ?? 1
        !! $n * fac($n - 1)
}

There's one fac sub but we need a call frame for each invocation of (that is,
call to) fac, since the $n parameter will vary in each call. (Threads are
another example where you're "in" a sub multiple times at the same time.)
All complex software systems evolve from simple systems. MoarVM is no exception.
Back when MoarVM started out, I knew I wanted to have invocation be cheap,
and call frames be fairly lightweight. I also didn't want them to be GC-allocated.
I figured that code sat in a loop, only using native types and only calling
things involving native types, should not create garbage that needed collecting.
All good goals.
Fast forward a few years, and where are we? Let's start out with the easy one
to assess: frames aren't GC-allocated. So that's good, right? Well, sure, in
that I got the natives property that I was after. However, since things like
closures and continuations exist, not to mention that you can get a first-class
reference to a call frame and traverse the outer/caller chain, the
lifetime of frames is interesting. They most certainly don't always
just go away at the point of return. Therefore, they need to have their
memory managed in some way. I went with reference counts, figuring that since
we'd only need to twiddle them fairly occasionally, it'd be fairly OK. Trouble
is, thanks to MoarVM supporting concurrent threads of execution, those counts
need to be incremented and decremented using atomic operations. Those are CPU
native, but they're still a bit costly (more so on some CPUs that others).
There's another, more hidden, cost, however - one I didn't really see coming.
MoarVM has a generational garbage collector, as discussed in my previous post.
But frames are not garbage collectable objects. They're managed by reference
counts. So what happens when a reference counted frame is referenced by a
second generation object? Well, there's no risk of the frames going away too
early; the reference count won't be decremented until the gen2 object itself is
collected. The problem is about the objects the frame references. Frames, not
being garbage collectable, don't have write barriers applied on binds
into them. This means that they can come at any time to point to nursery
objects. We solved this by keeping all objects referencing frames in the
inter-generational root set. This is perfectly safe. Unfortunately, it also
greatly increases the cost of garbage collection for programs that build up
large numbers of closures in memory and keep them around. Of course, since
write barriers are cheap but not free, we get a performance win on all programs
by not having to apply them to writes to working registers or lexical.
So, how about invocation cost? Is invocation cheap? Well, first of all lets
turn off inlining:
SET MVM_SPESH_INLINE_DISABLE=1

And measure 10 million invocations passing/receiving one argument using Perl 5,
NQP, and Rakudo. Perl 5 does them in 2.85s. NQP comes out a little ahead, at
2.45s. Rakudo strolls through them in an altogether too leisurely 6.14s. (Turn
inlining back on, and Rakudo manages it in 3.39s.) So, if NQP is already ahead,
is MoarVM really so bad? Well, it could certainly be better. On an idealized
3GHz GPU, each invocation is costing around 735 CPU cycles. That's pricey. The
other issue here is that just matching Perl 5 on invocation speed isn't really
enough, because tons of things that aren't invocations in Perl 5 actually are
in Perl 6 (like, every array and hash index). In a "Perl 6 is implemented in
Perl 6" world, we need to squeeze a good bit more out of invocation performance.
And finally, what about size? An MVMFrame comes at a cost of 296 bytes. It
points to a chunk of working space together with a lexical environment (both
arrays). Every single closure we take also pays that fixed 296 byte cost (and,
of course, the cost of the lexical environment storage, since that's what we
actually take closures for). Again, not staggeringly huge, but it adds up
very quickly.
These are all areas that need improvement. In fact, they make up two of the
entries in the performance section of the proposal
for the grant I'm doing this work under. So, I decided it was time to start
thinking about how I'll address them.
Some measurements

I was curious how many frames end up referenced by garbage collectable objects
against how many never end up in this situation. So, I quickly patched MoarVM to
keep track of if a frame ever came to be referenced by a GC-able object:
diff --git a/src/core/frame.c b/src/core/frame.c
index ca1a4d2..f392aca 100644
--- a/src/core/frame.c
+++ b/src/core/frame.c
@@ -114,7 +114,10 @@ MVMFrame * MVM_frame_dec_ref(MVMThreadContext *tc, MVMFrame *frame) {
      * to zero, so we look for 1 here. */
     while (MVM_decr(&frame->ref_count) == 1) {
         MVMFrame *outer_to_decr = frame->outer;
-
+if (frame->refd_by_object)
+    tc->instance->refd_frames++;
+else
+    tc->instance->non_refd_frames++;
         /* If there's a caller pointer, decrement that. */
         if (frame->caller)
             frame->caller = MVM_frame_dec_ref(tc, frame->caller);
diff --git a/src/core/instance.h b/src/core/instance.h
index b14f11d..4f61000 100644
--- a/src/core/instance.h
+++ b/src/core/instance.h
@@ -365,6 +365,9 @@ struct MVMInstance {

     /* Cached backend config hash. */
     MVMObject *cached_backend_config;
+
+MVMuint64 refd_frames;
+MVMuint64 non_refd_frames;
 };

 /* Returns a true value if we have created user threads (and so are running adiff --git a/src/main.c b/src/main.c
index 5458912..1df4fe3 100644
--- a/src/main.c
+++ b/src/main.c
@@ -189,7 +189,9 @@ int main(int argc, char *argv[])

     if (dump) MVM_vm_dump_file(instance, input_file);
     else MVM_vm_run_file(instance, input_file);
-
+printf("Ref'd frames: %d\nNon-ref'd frames: %d\n",
+    instance->refd_frames,
+    instance->non_refd_frames);
     if (full_cleanup) {
         MVM_vm_destroy_instance(instance);
         return EXIT_SUCCESS;

And measured a few things (the names from the latter ones are benchmark names
from perl6-bench):
Measured                    Ref'd       Non-ref'd       % Ref'd
========                    =====       =========       =======
NQP startup                 0           5259            0.0%
NQP regex tests             28065       1682655         1.6%
Compile Perl 6 actions      115092      6100770         1.7%
Compile Perl 6 grammar      130716      5451120         2.3%
Compile CORE.setting        2065214     55771097        3.8%
Perl 6 startup              35          12822           0.3%
Compiling Test.pm6          39639       860474          4.4%
Compiling NativeCall.pm6    145426      1887682         7.2%
while_array_set             993701      6024920         14.1%
while_hash_set              1804        2024016         0.1%
for_assign                  1654        1020831         0.2%
for_concat_2                1743        2023589         0.1%
split_string_regex          8992750     19089026        32.0%
create_and_iterate_hash_kv  14990870    40027814        27.2%
parse_json                  10660068    42364909        20.1%
rc-forest-fire              3740096     16202368        18.8%
rc-mandelbrot               89989       5523439         1.6%
rc-man-or-boy-test          791961      7091381         10%

What can we infer from this? First of all, most NQP programs have at most just
a few percent of their frames referenced by GC-able objects. With the Perl 6
benchmarks, it's all over the map, with split_string_regex being the "worst"
case. NQP's optimizer is much better doing lexical to local lowering, and
flattening away scopes that we don't really need. In Rakudo, we're pretty weak
at that. Clearly, some more work on this area could benefit Rakudo (and yes,
it's also on the list of things to do under my grant).
Secondly, since - even in the worst cases - the majority of frames never get
themselves tied up with any "interesting" situations that causes them to become
GC-referenced, a strategy that handles them differently - and hopefully far
more efficiently - would give us a win.
What GC-able things reference frames?

It was fairly easy to grep through the MoarVM source and make a list. I did so to
help me think through the cases:

A frame being closed over (closures)
A frame being captured in a continuation
A frame being referenced by an exception
A local/lexical being referenced by a native reference
A frame becoming the default outer thanks to "auto-close" (rare)
A frame getting wrapped in a context object, to use it as a first-class object
(also, hopefully, rare in any hot-path code)

It's also interesting to note that a frame only ever "escapes" such that it can
be touched by another thread if it becomes referenced by a GC-able object.
What makes frames take up space?

Next, I decided to to through the MVMFrame data structure and see where the
space is going, and what options might exist for saving that space. What follows
is an analysis of all the fields in an MVMFrame.
/* The thread that is executing, or executed, this frame. */
MVMThreadContext *tc;

Interestingly, this one gets cleared after a certain point in the frame's life,
except if it's captured in a continuation. Exception handling uses it to know
if the frame is still on the call stack, which is interesting in various cases.
GC marking uses it to know if it should mark ->work (see below).
Interestingly, nothing seems to care overly much at the moment that it points
to a particular thread context; they all want it for a flag. So, it's certainly
a candidate for removal. It's also interesting to note that in every case where
a frame is not referenced by an object, it is alive solely by
being in a thread's "call stack" - that is, the call chain from following the
->caller pointer from the currently executing frame of a thread. So, the flag
will only matter for frames that are GC-referenced.
/* The environment for this frame, which lives beyond its execution.
* Has space for, for instance, lexicals. */
MVMRegister *env;

Relevant for frames in whatever state.
/* The temporary work space for this frame. After a call is over, this
* can be freed up. Must be NULLed out when this happens. */
MVMRegister *work;

Relevant for frames that are still executing, or that are captured by a
continuation. Cross-cuts whether they are GC-referenced.
/* The args buffer. Actually a pointer into an area inside of *work, to
* decrease number of allocations. */
MVMRegister *args;

Possibly could go away through a level of indirection, but it's performance
sensitive. Used together with...
/* Callsite that indicates how the current args buffer is being used, if
* it is. */
MVMCallsite *cur_args_callsite;

...this one.
/* The outer frame, thus forming the static chain. */
MVMFrame *outer;

Pretty much everything has an outer.
/* The caller frame, thus forming the dynamic chain. */
MVMFrame *caller;

Pretty much everything has a caller too.
/* The static frame information. Holds all we statically know about
* this kind of frame, including information needed to GC-trace it. */
MVMStaticFrame *static_info;

As you might guess, this is pretty important and useful. However, it's also
possible to obtain it - at the cost of a level of indirection - through the
->code_ref below. Would need to measure carefully, since it'd increase the
cost of things like lexical lookups from outer frames (and, once we get better
at optimizing, that will be "most of them").
/* The code ref object for this frame. */
MVMObject *code_ref;

The particular closure we were invoked as. Not something we can obviously lose,
and needed for the lifetime of the frame in general.
/* Parameters received by this frame. */
MVMArgProcContext params;

Argument processing context. Every frame uses it to process its arguments. It's
only useful while ->work is active, however, and so could be allocated as a
part of that instead, which would reduce the cost of closures.
/* Reference count for the frame. */
AO_t ref_count;

Can go away provided we stop reference counting frames.
/* Is the frame referenced by a garbage-collectable object? */
MVMint32 refd_by_object;

Could also go away provided we stop reference counting frames and have some
scheme for optimizing the common, non-referenced case.
/* Address of the next op to execute if we return to this frame. */
MVMuint8 *return_address;

/* The register we should store the return value in, if any. */
MVMRegister *return_value;

/* The type of return value that is expected. */
MVMReturnType return_type;

/* The 'entry label' is a sort of indirect return address
* for the JIT */
void * jit_entry_label;

These four are only used when the frame is currently on the call stack, or may
be re-instated onto the call stack by a continuation being invoked. Could also
live with ->work, thus making closures cheaper.
/* If we want to invoke a special handler upon a return to this
* frame, this function pointer is set. */
MVMSpecialReturn special_return;

/* If we want to invoke a special handler upon unwinding past a
* frame, this function pointer is set. */
MVMSpecialReturn special_unwind;

/* Data slot for the special return handler function. */
void *special_return_data;

/* Flag for if special_return_data need to be GC marked. */
MVMSpecialReturnDataMark mark_special_return_data;

Used relatively occasionally (and the more common uses are candidates for
spesh, the dynamic optimizer, to optimize out anyway). A candidate for hanging
off an "extra stuff" pointer in a frame. Also, only used when a frame is on the
call stack, with the usual continuation caveat.
/* Linked list of any continuation tags we have. */
MVMContinuationTag *continuation_tags;

Used if this frame has been tagged as a possible continuation "base" frame.
Only relevant if that actually happens (which is quite rare in the scheme of
things), and can only happen when a frame is on the call stack. A candidate for
similar treatment to the special return stuff.
/* Linked MVMContext object, so we can track the
* serialization context and such. */
/* note: used atomically */
MVMObject *context_object;

This is used when a context goes first-class. Thus, it implies the frame is
referenced by at least one GC-able object (in fact, this points to said
object). That's fairly rare. It can happen independently of whether the
frame is currently executing (so, unrelated to ->work lifetime).
/* Effective bytecode for the frame (either the original bytecode or a
* specialization of it). */
MVMuint8 *effective_bytecode;

/* Effective set of frame handlers (to go with the effective bytecode). */
MVMFrameHandler *effective_handlers;

/* Effective set of spesh slots, if any. */
MVMCollectable **effective_spesh_slots;

/* The spesh candidate information, if we're in one. */
MVMSpeshCandidate *spesh_cand;

These are all related to running optimized/specialized code. Only interesting
for frames currently on the call stack or captured in a continuation (so,
->work lifetime once again).
/* Effective set of spesh logging slots, if any. */
MVMCollectable **spesh_log_slots;

/* If we're in a logging spesh run, the index to log at in this
* invocation. -1 if we're not in a logging spesh run, junk if no
* spesh_cand is set in this frame at all. */
MVMint8 spesh_log_idx;

/* On Stack Replacement iteration counter; incremented in loops, and will
* trigger if the limit is hit. */
MVMuint8 osr_counter;

These 3 play part a part in dynamic optimization too, though more in the stage
where we're gathering information. Again, they have ->work lifetime. The top
may well go away in future optimizer changes, so not worth worrying over too
much now.
/* GC run sequence number that we last saw this frame during. */
AO_t gc_seq_number;

This one is certainly a candidate for going away, post-refactoring. It serves
as the equivalent of a "mark bit" when doing GC.
/* Address of the last op executed that threw an exception; used just
* for error reporting. */
MVMuint8 *throw_address;

May be something we can move inside of exception objects, and have them pay for
it, not every frame. Worth looking in to.
/* Cache for dynlex lookup; if the name is non-null, the cache is valid
* and the register can be accessed directly to find the contextual. */
MVMString   *dynlex_cache_name;
MVMRegister *dynlex_cache_reg;
MVMuint16    dynlex_cache_type;

These also have ->work lifetime. Give a huge speed-up on dynlex access, so
(aside from re-designing that) they can stay.
/* The allocated work/env sizes. */
MVMuint16 allocd_work;
MVMuint16 allocd_env;

These exist primarily because we allocate work and env using the fixed
size allocator, and so we need the sizes to free the memory.
/* Flags that the caller chain should be kept in place after return or
* unwind; used to make sure we can get a backtrace after an exception. */
MVMuint8 keep_caller;

/* Flags that the frame has been captured in a continuation, and as
* such we should keep everything in place for multiple invocations. */
MVMuint8 in_continuation;

/* Assorted frame flags. */
MVMuint8 flags;

It appears the top two could be nicely folded into flags. Also, the flags may
only be relevant for currently executing frames, or those captured in a
continuation, so this lot is a candidate to move to something with ->work
lifetime.
Observations

Here are some things that stand out to me, and that point the way to an
alternate design.

An MVMFrame presently carries a bunch of things in it that aren't relevant
unless the frame is either currently on a thread's call stack or captured in
a continuation.
This is an orthogonal axis to whether the frame is referenced by something
that is garbage-collectable.
It's further orthogonal to one of a number of relatively rare things that
can happen and need storage in the frame.
Frames that are never referenced by a garbage collectable object will only
ever have a reference count of 1, because they will only be alive by virtue
of being either the currently executing frame of a thread, or in its caller
chain.
Frames only become referenced by something garbage collectable in cases where
we'd end up with some other garbage-collectable allocation anyway. For
example, in the closure case, we allocate the code-ref that points to the
referenced outer frame.
Let's assume we were to allocate all frames using the GC, and consider the
analysis that would let us known when we are able to avoid those allocations. The
analysis needed would be escape analysis.

A new approach: the big picture

Taking these into account, I arrived at a way forward that should, I hope,
address most of the issues at hand.
Every thread will have a chunk of memory that we'll refer to as its "call stack".
Every new frame created during normal program execution will be allocated by
making space for it, including its ->work and ->env, on this stack. This
will need:

No reference count, because we know it's 1
No gc_seq_number, because we can use the stack topology to make sure we only
mark each frame once

Should this frame ever become referenced by a garbage collectable object, then
we will GC-allocate a frame on the garbage-collected heap - as a totally normal
garbage-collectable object. The frame state will be copied into this. The work
space and environment will also be allocated from the fixed-size allocator,
and the data migrated there.
Since this frame is now garbage-collectable, we have to check its ->caller to
see if it's on the thread-local stack, or already been promoted to the heap. If
the former, we repeat the above process for it too. This is in order to uphold
the key invariant in this design: the thread-local stack may point to things in
the garbage-collectable heap, but never vice-versa.
This means the reference counting and its manipulation goes away entirely, and
that frames that are heap-promoted become subject to the usual generational
rules. Frames that would never be heap-referenced never end up on the heap,
don't add to GC pressure, and can be cleaned up immediately and cheaply.
There are some details to care about, of course. Since generational collection
involves write barriers, then binds into frames on the garbage-collectable
heap will also be subject to write barriers. Is that OK? There are two cases
to consider.

Binding of lexicals. Since most lexicals in Perl 6 point to a Scalar,
Array, or Hash in my declarations, or point directly to a read-only
object if parameters, this is relatively rare (of course, write barriers
apply to the Scalar itself). In NQP, loads of lexicals are lowered to
locals already, and we'll do some more of that in Rakudo too, making it
rarer still. Long story short, we can afford write barriers on lexical
binds.
Binding of stuff in ->work, which basically means every write into the
register set of the interpreter. This, we cannot afford to barrier.
However, there are only two cases where a frame is promoted to the heap
and has ->work. One case is when it's still executing, and so in the
call chain of a thread. In this case, we can take care to always walk the
objects in ->work by simply following the call chain . The second case is
when a continuation is taken. But here, there are no binds to registers until
the continuation is invoked again - at which point things are back in a
thread's call chain.

Refactoring towards it

The thing that makes this a somewhat scary piece of work is that, in making call
frames potentially collectable objects, we break an assumption that has been
there since week 1 of MoarVM's development: that call frames never move. To
maximize the chances of discovering problems with this refactor, I decided that
step 1 would be to always allocate every single call frame on the heap. Only
when that is working would I move on to optimizing away most of those heap
allocations by adding the thread-local call stack.
MoarVM currently has 3 kinds of collectable:

Objects
Type objects
STables

So, I added a forth:
call frames. As a result, MVMFrame gains an MVMCollectable
at the start of the data structure - which will be present whether it's stack or
heap allocated. This will start out zeroed when a frame is born on the call stack.
This does two nice things: it gives us a way to know if a frame is GC-able or not,
and also means the write barrier - without modification - will do the right thing
on both stack and heap frames.
There were two more easy things to do. First was to add a function to allocate
a heap frame.
Second was to factor out frame destruction from reference decrement,
since the latter was going away.
Beyond that, there was nothing for it besides diving it, breaking the world, and
then trying to put it back together again. I got a good start towards it - but
the conclusion of this first step will have to wait for next week's installment!
See you then.