Last week, I did some work to greatly decrease the size of call frames, also know as invocation records, in MoarVM. In theory, a call frame is created whenever a sub, method, regex, or block is entered. In reality, scopes may be flattened away at compile time, which decreases the number of callframes needed. Further to that, dynamic optimization at runtime leads to inlining, which has the same result except it can do it with late-bound calls, even with callees in different compilation units. Even with these optimizations (and in part because of their limitations), most programs will still need to create and destroy many call frames over their lifetime, making their setup and teardown a hot path, and their size a factor in program memory performance.
The work I'll describe in this post has been funded by OETIKER+PARTNER AG, who responded to my recent funding call. In fact, they're funding around 10 hours of work per month over the space of a year, so this will be just the first of a number of posts describing work that they are making possible. So, thanks!
The MVMFrame
data structure has been there since the very earliest days of
MoarVM. My memory is hazy, but I suspect it was among the first dozen data
structures that I sketched out when starting to design and implement the VM.
Initially, call frames were reference counted and allocated out of a special
pool. Now they live either in a per-thread callsite region allocated by
incrementing a pointer and deallocated by decrementing the pointer, much like
a traditional call stack, or on the heap if they "escape" (as a result of a
closure, exception throw, or continuation). At present, that heap promotion
is always lazy (so call frames are always born on the call stack).
Therefore, the size of an MVMFrame
impacts:
- How much memory we have to allocate in the call stack region, and thus how quickly we fill up that region if we're recursing deeply
- How much memory we have to touch when setting up a frame (which impacts on
CPU cache utilization); until this work the frame memory was cleared by a
memset
call, so the size also impacted how much work that had to do - How much we have to copy if promoting the frame onto the heap
- How soon we'll have to GC again (smaller frames on the heap mean longer until we fill the nursery and have to collect)
- How much memory closures take
Over the years, MVMFrame
has grown. Various things were added in support of
additional features. However, not all of those things are used by all frames.
Additionally, some of them, while used widely, were both quite rarely read and
very cheap to compute when they were needed, meaning it was possible to just
re-calculate them on demand.
Two pointers were taken up for storing the effective handlers and bytecode.
The bytecode is the instructions to execute, and the handlers are the regions
of that bytecode covered by exception handlers. At first, these fields did not
exist in MVMFrame
. The bytecode and handlers weren't properties of a given
call, but of the code being called (known as the "static frame" in MoarVM).
So why these fields?
They were introduced with the dynamic optimizer. This produces one or more versions of the frame's bytecode specialized by callsite and type. The specialized version of the code is selected based upon what the callee passes. Since the specialized bytecode contains different (typically many less) instructions, the code offests covered by exception handlers move also, so we need to use an updated table of those when locating an exception handler.
It was certainly convenient to hang pointers to these two off the frame. But we could always
locate them just by following the spesh_cand
pointer in the frame, if it
was set, or the static_info
pointer if not. And this isn't something that we
needed to do so often that this extra dereference was going to add up. The
only common path we needed to do it was on return. But we'd be losing the
instruction to set it in the invocation, so it about balances anyway - and
that's before the considering the memory savings.
With those gone, MVMFrame
shrunk 2 pointers, or 16 bytes in a 64-bit
environment.
When a frame threw an exception, it stored the address in the program that it was thrown at into...the frame that threw it. It then referenced the frame from the exception object. And that was pretty much all the throw address field was used for. This was a very easy 8 bytes (64-bit pointer) to win: just store the address in the exception object anyway. D'oh.
Some things are used by just a small handful of frames:
- Continuation tags (used to find the base of a continuation when taking it)
- Slots to cache dynamic lexical lookups (so
$*FOO
can be found faster) - Special return handlers (a way for the VM to set up internal callbacks when returning into a particular frame, which is a technique used to avoid nested interpreter instances, which are a huge problem when combined with continuations)
- A slot to keep an invoked call capture alive
These added up to 8 pointers and one 16-bit integer. By moving them into an
"extras" data structure hung off the frame, and allocated on demand, space
equivalent to 7 pointers could be saved off the frame. With 64-bit pointers,
that's 56 bytes of savings for most frames. The CORE.setting
compilation
ends up with only 6% of frames needing this extra storage space.
With a 16-bit integer moved off into the extras I realized that, with a
little care, we could re-order things, force an enum to only use a single
byte, and save another 8 bytes off MVMFrame
, simply by not needing some
empty wasted padding space in the struct (C compilers insert this to make
sure memory reads are aligned to correct boundaries).
That's 88 bytes of savings, which is around a cache line and a half on a
typical CPU. It also meant that nearly all of the things left in MVMFrame
were being initialized on every invocation as part of the callframe setup.
Meaning? That the memset
of MVMFrame
could go away, at the cost of just
inserting a couple of instructions to manually zero things out (some of them
on rarely taken paths).
While I was working on this, and looking at profiles, I noticed that a large number of allocations came from creating a buffer to keep track of which named arguments had been used and which had not (for the sake of error reporting and slurpy argument handling). We allocated a byte array, but only really need a bit per named argument. So, I turned the field into a union of a 64-bit bit field for when there are at most 64 named arguments (which is probably just about every real world use case), and fall back to the old byte array approach otherwise.
These changes provide some memory use reductions, but more importantly are good for CPU cache locality. They also knocked 2% off the number of CPU instructions run during CORE.setting compilation; many other programs should see a similar improvement (how much depending on how much of a hot path invocation is).