jnthn/plan.md Secret

## plan.md

      
    Raw
  

              plan.md
            
          
    Spesh change plan: stage 1

A recent document listed shortcomings
with the MoarVM dynamic optimizer. Rather than trying to attack them all at
once, they will be taken on in groups, and some of them individually. The
first stage of changes, described in this document, impact on the overall
structure of the specialization process. The shortcomings it aims to address
include:

No baseline optimization; it's all or nothing
Optimization is done on threads running code
Bad statistics
Playing fast and loose with aliasing
OSR doesn't sample values written outside of the loop

And it will make progress on:

Insufficient testing
Closures, even directly invoked, are opaque to spesh

Summary of changes

It's hard to describe the changes without any forward references, so here is a
big picture view of what's changing.

A specialization worker thread will be spawned at startup to do all of the
specialization and JIT-compilation work.
Each thread running code will have a log buffer that it writes into. This
log will include information on hot code and logged types. When the buffer
fills, it will be posted to the specialization thread by placing it into
a concurrent queue.
In debug mode, a thread will, after sending a log buffer, pause until the
specialization worker has processed the buffer and produced specializations
based upon the logged information. This means that, at least for a single
threaded program, it will be possible to get deterministic behavior.
There will be two kinds of specialization: certain specialization and
speculative specialization. Certain specializations can never cause a
deoptimization in their own right, and so may not depend on observed types
and/or code objects that would ential the insertion of guards. Speculative
specialization, by contrast, uses observed types and code objects.
Selection of a specialization will be done using either a mini-interpreter
or JIT-compiled specialization selectors. This will allow guards themselves
to be turned into more efficient code (for example, turning a Scalar fetch
to validate the value inside into a pointer read).
The container-type-with-value-inside guards will go away, in favor of guards
placed after decont (and type logging of decont instructions`). This is
a simplification, resolves potential aliasing bugs, and may also improve
performance.
The addition of logging and guarding on static frame type will allow for
specialization selection and inlining in places it is presently not
possible.

Kinds of specialization

No specialization means the original bytecode is run by the interpreter.
Besides there not being any optimizations, this mode will also write logs of
types encountered for the sake of the specializer.
Certain specializations are keyed on interned callsites only, and so may
be called whenever they match an interned MVMCallsite without any other
checks on the incoming arguments being performed. A certain specialization may
use the values of literals, including type information from a wval. Since it
is considered immutable once resolved, it may also use the recorded result of
a getlexstatic. No other logged information may be used, since that would
entail the insertion of guards. A single certain specialization will be
produced, if needed, for the case where there is not an interned callsite;
such code never receives any optimization today.
Certain specializations are most suitable for megamorphic code, or for cases
that rarely show up in the type log. There is a limit on how may speculative
specializations may be produced to stop an explosion of them in megamorphic
code paths; these will all fall back to using a certain specialization. This
means that code paths today that are too type-variable will still end up with
some base level of optimization, including JIT compilation.
Speculative specializations will typically include type guards on the
incoming argument types, and may include further type guards or code object
static frame guards inside of the optimized code. These may trigger
deoptimization. Speculative specializations are much like what MoarVM does
exclusively today. However, it is hoped that the way data will be collected
and analyzed will lead to smarter production of speculative specializations.
It is tempting to see speculative and certain specialization as tiers. While
this is somewhat true in that a speculative specialization will typically give
the opportunity for far more aggressive optimization, it's also the case that
a frame that the logs indicate to be most likely monomorphic, or only lightly
polymorphic, will immediately be given a speculative specialization or two.
For example, it's highly unlikely that the Perl 6 infix:<+>(Int, Int) multi
candidate will ever be called with anything except a pair of Int objects in
the vast majority of programs.
Interpreter logging

It is important to try and collect the right amount of data. Too little, and
required specializations may not be produced or there may not be enough data
to produce the correct ones. Too much, and it will eat memory, contain a lot
of duplication, and give the specialization worker a lot to analyze. It's also
important to have a design that copes with the required specializations
evolving over time.
As such, each static frame will have an interpreted calls counter. It will
start at zero, and be incremented each time the interpreted code is entered.
It will be treated as follows (numbers are examples, and will be tweaked):

n < 10 means no logging at all
10 < n < 200 means information is logged
n >= 200 means no further logging is performed; we "got enough"

Separately to this, the thread context also has a log entry ID. This will be
incremetned and copied into a slot in the frame (the "log entry ID"), and will
be used when logging callsite and type information. This means that types can
be correlated with callsites and other types in the same call, allowing the
specializer to spot patterns or use the logs in abstract interpretations.
A log is thread local, hanging off the ThreadContext. Its entries are
stored in an array of the following struct:
struct {
    MVMint32 record_kind;
    MVMint32 log_entry_id;
    union {
        struct entry {
            MVMStaticFrame *sf;
            MVMCallsite *cs;
        }
        struct type {
            MVMObject *type;
            MVMint32 flags; // For now, just has a bit for "concrete"
            MVMint32 bytecode_offset;
        }
        struct code {
            MVMObject *code_object;
        }
        struct osr {
            MVMint32 bytecode_offset;
        }
    }
}

Where record_kind is one of:

Entry(entry) (when we enter a frame in the interpreter)
Parameter(type) (the type of a received parameter)
Decont(type) (the type of value resulting from a decont)
Lexical(type) (the type of a value resulting from a lexical lookup)
Attribute(type) (the type of a value resulting from an attribute lookup)
Invoke(code) (the code object that was invoked)
OSR(osr) (an OSR point was crossed)

The union branch used being specified in the parentheses. (Note that in the
OSR case, this means that we'll have information logged about things outside
of the loop body, overcoming a limitation today. Also, hot loops will quickly
fill up a buffer, resulting in specialization happening for that quite soon.)
When the specialization worker thread installs a new specialization for a
static frame, it resets its interpreted entry count back to zero. This means
that further specializations for still-interpreted paths can be identified
and introduced.
The worker thread may in some cases decide it wishes to do further recording
to produce more type specializations as the program's workload evolves (maybe
because the certain specialization is being hit very often still); to do this,
it will update the specializtion selector to temporarily exclude the certain
specialization, resulting in the interpreter being used and logging being
performed again.
The specialization worker and communication

A thread will be spawned immediately at startup to serve as the specialization
worker. It will receive recorded metrics and further work to do from a
concurrent blocking queue (meaning that everything passed to it is a MoarVM
object). A new representation MVMSpeshLog will be used to send logged
specialization data to the worker; the log array hanging off the thread
context will be stored wrapped up in this (so the GC marking logic will be
contained in the REPR).
When a log is sent to the specialization thread, the thread context's current
log pointer will be nulled. It will be up to the specialization worker to put
a new and empty log there (which may in fact be the same object that was used
before but emptied out, to allow memory re-use). While the pointer is NULL, no
log data will be recorded. Therefore, the specialization worker can control
the amount of data it gets and keep it in step with its work rate. Over time,
the application will reach a state where all required specializations have
been produced, meaning nothing will be interpreted, no logs will be produced,
and so the specialization thread will cheaply sleep, blocked by the kernel.
For the debug case, the MVMSpeshLog object will contain a condvar or barrier
used to make the posting thread wait, and the specialization worker thread
will signal this after it has processed the buffer and reacted to its content.
The specialization worker should make sure that, at regular safe intervals,
it polls for if it should join in a GC run. Not needing to participate in
GC due to being waiting for work just falls naturally out of the blocking
queue.
Specialization selection

Initially, there are no specializations, and the interpreter runs performing
its logging. Each static frame contains a specialization selector, which will
be updated by the specialization worker thread. This will be done in such a
way that no locking is needed (for example, by a mechanism with a single-read
and a safepoint free).
The specialization selector will be structured as an operation tree, a lot
like the multi-dispatch cache today. This will be walked to find a matching
specialization. However, we can also JIT the tree into machine code too.
The branching nature of the tree will allow less tests than the current
design of today. It will also allow far more efficient guarding, by being
able to lift up a "it this a Scalar container" check and then, in child
nodes of the "yes" branch, being able to do a direct memory access to the
value slot rather than having to the late-bound fetch by calling via. a
function pointer. The composite "container containing something of type
T" guards will go away and be expressed in these simpler terms.
Guard simplification and decont type logging

The guards that look at the value inside of a container will be removed. In
their place will come guards at the point of decontainerization. This will
address various shortcomings with regard to aliasing in the current design
(where we may miss that a container is updated), and allow the elimination
of fragile logic in this area. A future stage of spesh improvement will
introduce more sophisticated alias analysis.
To compensate for this, type logging of decont instructions will be added,
and type guards added after a decont operation. As today, guards that go
unused will be eliminated. While this may at first blush seem ike it will be
a performance setback - two guards instead of one - in reality the guards for
value-in-container are costly, since they must call the container fetch
operation. By contrast, the decont instruction will typically be optimized
into a cheap memory dereference, and the guard to check for a type is also
cheap. Furthermore, situations where the actual type of the value taken out
of the container is not important currently pay for a container and value
guard anyway; this situation will naturally be resolved by this change.
Code object logging

The target code object of an invocation will be logged, and a new guard for
static frame of an invocation target will be added. This will enable us to do
inlining or at least pre-selection of the specialized code variant in cases
that today do not permit it.