A recent document listed shortcomings with the MoarVM dynamic optimizer. Rather than trying to attack them all at once, they will be taken on in groups, and some of them individually. The first stage of changes, described in this document, impact on the overall structure of the specialization process. The shortcomings it aims to address include:
- No baseline optimization; it's all or nothing
- Optimization is done on threads running code
- Bad statistics
- Playing fast and loose with aliasing
- OSR doesn't sample values written outside of the loop
And it will make progress on:
- Insufficient testing
- Closures, even directly invoked, are opaque to spesh
It's hard to describe the changes without any forward references, so here is a big picture view of what's changing.
- A specialization worker thread will be spawned at startup to do all of the specialization and JIT-compilation work.
- Each thread running code will have a log buffer that it writes into. This log will include information on hot code and logged types. When the buffer fills, it will be posted to the specialization thread by placing it into a concurrent queue.
- In debug mode, a thread will, after sending a log buffer, pause until the specialization worker has processed the buffer and produced specializations based upon the logged information. This means that, at least for a single threaded program, it will be possible to get deterministic behavior.
- There will be two kinds of specialization: certain specialization and speculative specialization. Certain specializations can never cause a deoptimization in their own right, and so may not depend on observed types and/or code objects that would ential the insertion of guards. Speculative specialization, by contrast, uses observed types and code objects.
- Selection of a specialization will be done using either a mini-interpreter
or JIT-compiled specialization selectors. This will allow guards themselves
to be turned into more efficient code (for example, turning a
Scalar
fetch to validate the value inside into a pointer read). - The container-type-with-value-inside guards will go away, in favor of guards
placed after
decont
(and type logging ofdecont
instructions`). This is a simplification, resolves potential aliasing bugs, and may also improve performance. - The addition of logging and guarding on static frame type will allow for specialization selection and inlining in places it is presently not possible.
No specialization means the original bytecode is run by the interpreter. Besides there not being any optimizations, this mode will also write logs of types encountered for the sake of the specializer.
Certain specializations are keyed on interned callsites only, and so may
be called whenever they match an interned MVMCallsite
without any other
checks on the incoming arguments being performed. A certain specialization may
use the values of literals, including type information from a wval
. Since it
is considered immutable once resolved, it may also use the recorded result of
a getlexstatic
. No other logged information may be used, since that would
entail the insertion of guards. A single certain specialization will be
produced, if needed, for the case where there is not an interned callsite;
such code never receives any optimization today.
Certain specializations are most suitable for megamorphic code, or for cases that rarely show up in the type log. There is a limit on how may speculative specializations may be produced to stop an explosion of them in megamorphic code paths; these will all fall back to using a certain specialization. This means that code paths today that are too type-variable will still end up with some base level of optimization, including JIT compilation.
Speculative specializations will typically include type guards on the incoming argument types, and may include further type guards or code object static frame guards inside of the optimized code. These may trigger deoptimization. Speculative specializations are much like what MoarVM does exclusively today. However, it is hoped that the way data will be collected and analyzed will lead to smarter production of speculative specializations.
It is tempting to see speculative and certain specialization as tiers. While
this is somewhat true in that a speculative specialization will typically give
the opportunity for far more aggressive optimization, it's also the case that
a frame that the logs indicate to be most likely monomorphic, or only lightly
polymorphic, will immediately be given a speculative specialization or two.
For example, it's highly unlikely that the Perl 6 infix:<+>(Int, Int)
multi
candidate will ever be called with anything except a pair of Int
objects in
the vast majority of programs.
It is important to try and collect the right amount of data. Too little, and required specializations may not be produced or there may not be enough data to produce the correct ones. Too much, and it will eat memory, contain a lot of duplication, and give the specialization worker a lot to analyze. It's also important to have a design that copes with the required specializations evolving over time.
As such, each static frame will have an interpreted calls counter. It will start at zero, and be incremented each time the interpreted code is entered. It will be treated as follows (numbers are examples, and will be tweaked):
n < 10
means no logging at all10 < n < 200
means information is loggedn >= 200
means no further logging is performed; we "got enough"
Separately to this, the thread context also has a log entry ID. This will be incremetned and copied into a slot in the frame (the "log entry ID"), and will be used when logging callsite and type information. This means that types can be correlated with callsites and other types in the same call, allowing the specializer to spot patterns or use the logs in abstract interpretations.
A log is thread local, hanging off the ThreadContext
. Its entries are
stored in an array of the following struct
:
struct {
MVMint32 record_kind;
MVMint32 log_entry_id;
union {
struct entry {
MVMStaticFrame *sf;
MVMCallsite *cs;
}
struct type {
MVMObject *type;
MVMint32 flags; // For now, just has a bit for "concrete"
MVMint32 bytecode_offset;
}
struct code {
MVMObject *code_object;
}
struct osr {
MVMint32 bytecode_offset;
}
}
}
Where record_kind
is one of:
Entry(entry)
(when we enter a frame in the interpreter)Parameter(type)
(the type of a received parameter)Decont(type)
(the type of value resulting from a decont)Lexical(type)
(the type of a value resulting from a lexical lookup)Attribute(type)
(the type of a value resulting from an attribute lookup)Invoke(code)
(the code object that was invoked)OSR(osr)
(an OSR point was crossed)
The union branch used being specified in the parentheses. (Note that in the OSR case, this means that we'll have information logged about things outside of the loop body, overcoming a limitation today. Also, hot loops will quickly fill up a buffer, resulting in specialization happening for that quite soon.)
When the specialization worker thread installs a new specialization for a static frame, it resets its interpreted entry count back to zero. This means that further specializations for still-interpreted paths can be identified and introduced.
The worker thread may in some cases decide it wishes to do further recording to produce more type specializations as the program's workload evolves (maybe because the certain specialization is being hit very often still); to do this, it will update the specializtion selector to temporarily exclude the certain specialization, resulting in the interpreter being used and logging being performed again.
A thread will be spawned immediately at startup to serve as the specialization
worker. It will receive recorded metrics and further work to do from a
concurrent blocking queue (meaning that everything passed to it is a MoarVM
object). A new representation MVMSpeshLog
will be used to send logged
specialization data to the worker; the log array hanging off the thread
context will be stored wrapped up in this (so the GC marking logic will be
contained in the REPR).
When a log is sent to the specialization thread, the thread context's current log pointer will be nulled. It will be up to the specialization worker to put a new and empty log there (which may in fact be the same object that was used before but emptied out, to allow memory re-use). While the pointer is NULL, no log data will be recorded. Therefore, the specialization worker can control the amount of data it gets and keep it in step with its work rate. Over time, the application will reach a state where all required specializations have been produced, meaning nothing will be interpreted, no logs will be produced, and so the specialization thread will cheaply sleep, blocked by the kernel.
For the debug case, the MVMSpeshLog
object will contain a condvar or barrier
used to make the posting thread wait, and the specialization worker thread
will signal this after it has processed the buffer and reacted to its content.
The specialization worker should make sure that, at regular safe intervals, it polls for if it should join in a GC run. Not needing to participate in GC due to being waiting for work just falls naturally out of the blocking queue.
Initially, there are no specializations, and the interpreter runs performing its logging. Each static frame contains a specialization selector, which will be updated by the specialization worker thread. This will be done in such a way that no locking is needed (for example, by a mechanism with a single-read and a safepoint free).
The specialization selector will be structured as an operation tree, a lot
like the multi-dispatch cache today. This will be walked to find a matching
specialization. However, we can also JIT the tree into machine code too.
The branching nature of the tree will allow less tests than the current
design of today. It will also allow far more efficient guarding, by being
able to lift up a "it this a Scalar
container" check and then, in child
nodes of the "yes" branch, being able to do a direct memory access to the
value slot rather than having to the late-bound fetch by calling via. a
function pointer. The composite "container containing something of type
T" guards will go away and be expressed in these simpler terms.
The guards that look at the value inside of a container will be removed. In their place will come guards at the point of decontainerization. This will address various shortcomings with regard to aliasing in the current design (where we may miss that a container is updated), and allow the elimination of fragile logic in this area. A future stage of spesh improvement will introduce more sophisticated alias analysis.
To compensate for this, type logging of decont
instructions will be added,
and type guards added after a decont operation. As today, guards that go
unused will be eliminated. While this may at first blush seem ike it will be
a performance setback - two guards instead of one - in reality the guards for
value-in-container are costly, since they must call the container fetch
operation. By contrast, the decont instruction will typically be optimized
into a cheap memory dereference, and the guard to check for a type is also
cheap. Furthermore, situations where the actual type of the value taken out
of the container is not important currently pay for a container and value
guard anyway; this situation will naturally be resolved by this change.
The target code object of an invocation will be logged, and a new guard for static frame of an invocation target will be added. This will enable us to do inlining or at least pre-selection of the specialized code variant in cases that today do not permit it.
This may be exposing my ignorance, but would it be possible to attach spesh info to pre-comp bytecode? Or does pre-compiling not run the code long enough for spesh to have learned anything useful?