Skip to content

Instantly share code, notes, and snippets.

@jnthn

jnthn/plan.md Secret

Last active July 19, 2017 16:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jnthn/8ee6dc06edc7cd860059b87ffabb86b5 to your computer and use it in GitHub Desktop.
Save jnthn/8ee6dc06edc7cd860059b87ffabb86b5 to your computer and use it in GitHub Desktop.
Spesh change plan: stage 1

Spesh change plan: stage 1

A recent document listed shortcomings with the MoarVM dynamic optimizer. Rather than trying to attack them all at once, they will be taken on in groups, and some of them individually. The first stage of changes, described in this document, impact on the overall structure of the specialization process. The shortcomings it aims to address include:

  • No baseline optimization; it's all or nothing
  • Optimization is done on threads running code
  • Bad statistics
  • Playing fast and loose with aliasing
  • OSR doesn't sample values written outside of the loop

And it will make progress on:

  • Insufficient testing
  • Closures, even directly invoked, are opaque to spesh

Summary of changes

It's hard to describe the changes without any forward references, so here is a big picture view of what's changing.

  • A specialization worker thread will be spawned at startup to do all of the specialization and JIT-compilation work.
  • Each thread running code will have a log buffer that it writes into. This log will include information on hot code and logged types. When the buffer fills, it will be posted to the specialization thread by placing it into a concurrent queue.
  • In debug mode, a thread will, after sending a log buffer, pause until the specialization worker has processed the buffer and produced specializations based upon the logged information. This means that, at least for a single threaded program, it will be possible to get deterministic behavior.
  • There will be two kinds of specialization: certain specialization and speculative specialization. Certain specializations can never cause a deoptimization in their own right, and so may not depend on observed types and/or code objects that would ential the insertion of guards. Speculative specialization, by contrast, uses observed types and code objects.
  • Selection of a specialization will be done using either a mini-interpreter or JIT-compiled specialization selectors. This will allow guards themselves to be turned into more efficient code (for example, turning a Scalar fetch to validate the value inside into a pointer read).
  • The container-type-with-value-inside guards will go away, in favor of guards placed after decont (and type logging of decont instructions`). This is a simplification, resolves potential aliasing bugs, and may also improve performance.
  • The addition of logging and guarding on static frame type will allow for specialization selection and inlining in places it is presently not possible.

Kinds of specialization

No specialization means the original bytecode is run by the interpreter. Besides there not being any optimizations, this mode will also write logs of types encountered for the sake of the specializer.

Certain specializations are keyed on interned callsites only, and so may be called whenever they match an interned MVMCallsite without any other checks on the incoming arguments being performed. A certain specialization may use the values of literals, including type information from a wval. Since it is considered immutable once resolved, it may also use the recorded result of a getlexstatic. No other logged information may be used, since that would entail the insertion of guards. A single certain specialization will be produced, if needed, for the case where there is not an interned callsite; such code never receives any optimization today.

Certain specializations are most suitable for megamorphic code, or for cases that rarely show up in the type log. There is a limit on how may speculative specializations may be produced to stop an explosion of them in megamorphic code paths; these will all fall back to using a certain specialization. This means that code paths today that are too type-variable will still end up with some base level of optimization, including JIT compilation.

Speculative specializations will typically include type guards on the incoming argument types, and may include further type guards or code object static frame guards inside of the optimized code. These may trigger deoptimization. Speculative specializations are much like what MoarVM does exclusively today. However, it is hoped that the way data will be collected and analyzed will lead to smarter production of speculative specializations.

It is tempting to see speculative and certain specialization as tiers. While this is somewhat true in that a speculative specialization will typically give the opportunity for far more aggressive optimization, it's also the case that a frame that the logs indicate to be most likely monomorphic, or only lightly polymorphic, will immediately be given a speculative specialization or two. For example, it's highly unlikely that the Perl 6 infix:<+>(Int, Int) multi candidate will ever be called with anything except a pair of Int objects in the vast majority of programs.

Interpreter logging

It is important to try and collect the right amount of data. Too little, and required specializations may not be produced or there may not be enough data to produce the correct ones. Too much, and it will eat memory, contain a lot of duplication, and give the specialization worker a lot to analyze. It's also important to have a design that copes with the required specializations evolving over time.

As such, each static frame will have an interpreted calls counter. It will start at zero, and be incremented each time the interpreted code is entered. It will be treated as follows (numbers are examples, and will be tweaked):

  • n < 10 means no logging at all
  • 10 < n < 200 means information is logged
  • n >= 200 means no further logging is performed; we "got enough"

Separately to this, the thread context also has a log entry ID. This will be incremetned and copied into a slot in the frame (the "log entry ID"), and will be used when logging callsite and type information. This means that types can be correlated with callsites and other types in the same call, allowing the specializer to spot patterns or use the logs in abstract interpretations.

A log is thread local, hanging off the ThreadContext. Its entries are stored in an array of the following struct:

struct {
    MVMint32 record_kind;
    MVMint32 log_entry_id;
    union {
        struct entry {
            MVMStaticFrame *sf;
            MVMCallsite *cs;
        }
        struct type {
            MVMObject *type;
            MVMint32 flags; // For now, just has a bit for "concrete"
            MVMint32 bytecode_offset;
        }
        struct code {
            MVMObject *code_object;
        }
        struct osr {
            MVMint32 bytecode_offset;
        }
    }
}

Where record_kind is one of:

  • Entry(entry) (when we enter a frame in the interpreter)
  • Parameter(type) (the type of a received parameter)
  • Decont(type) (the type of value resulting from a decont)
  • Lexical(type) (the type of a value resulting from a lexical lookup)
  • Attribute(type) (the type of a value resulting from an attribute lookup)
  • Invoke(code) (the code object that was invoked)
  • OSR(osr) (an OSR point was crossed)

The union branch used being specified in the parentheses. (Note that in the OSR case, this means that we'll have information logged about things outside of the loop body, overcoming a limitation today. Also, hot loops will quickly fill up a buffer, resulting in specialization happening for that quite soon.)

When the specialization worker thread installs a new specialization for a static frame, it resets its interpreted entry count back to zero. This means that further specializations for still-interpreted paths can be identified and introduced.

The worker thread may in some cases decide it wishes to do further recording to produce more type specializations as the program's workload evolves (maybe because the certain specialization is being hit very often still); to do this, it will update the specializtion selector to temporarily exclude the certain specialization, resulting in the interpreter being used and logging being performed again.

The specialization worker and communication

A thread will be spawned immediately at startup to serve as the specialization worker. It will receive recorded metrics and further work to do from a concurrent blocking queue (meaning that everything passed to it is a MoarVM object). A new representation MVMSpeshLog will be used to send logged specialization data to the worker; the log array hanging off the thread context will be stored wrapped up in this (so the GC marking logic will be contained in the REPR).

When a log is sent to the specialization thread, the thread context's current log pointer will be nulled. It will be up to the specialization worker to put a new and empty log there (which may in fact be the same object that was used before but emptied out, to allow memory re-use). While the pointer is NULL, no log data will be recorded. Therefore, the specialization worker can control the amount of data it gets and keep it in step with its work rate. Over time, the application will reach a state where all required specializations have been produced, meaning nothing will be interpreted, no logs will be produced, and so the specialization thread will cheaply sleep, blocked by the kernel.

For the debug case, the MVMSpeshLog object will contain a condvar or barrier used to make the posting thread wait, and the specialization worker thread will signal this after it has processed the buffer and reacted to its content.

The specialization worker should make sure that, at regular safe intervals, it polls for if it should join in a GC run. Not needing to participate in GC due to being waiting for work just falls naturally out of the blocking queue.

Specialization selection

Initially, there are no specializations, and the interpreter runs performing its logging. Each static frame contains a specialization selector, which will be updated by the specialization worker thread. This will be done in such a way that no locking is needed (for example, by a mechanism with a single-read and a safepoint free).

The specialization selector will be structured as an operation tree, a lot like the multi-dispatch cache today. This will be walked to find a matching specialization. However, we can also JIT the tree into machine code too. The branching nature of the tree will allow less tests than the current design of today. It will also allow far more efficient guarding, by being able to lift up a "it this a Scalar container" check and then, in child nodes of the "yes" branch, being able to do a direct memory access to the value slot rather than having to the late-bound fetch by calling via. a function pointer. The composite "container containing something of type T" guards will go away and be expressed in these simpler terms.

Guard simplification and decont type logging

The guards that look at the value inside of a container will be removed. In their place will come guards at the point of decontainerization. This will address various shortcomings with regard to aliasing in the current design (where we may miss that a container is updated), and allow the elimination of fragile logic in this area. A future stage of spesh improvement will introduce more sophisticated alias analysis.

To compensate for this, type logging of decont instructions will be added, and type guards added after a decont operation. As today, guards that go unused will be eliminated. While this may at first blush seem ike it will be a performance setback - two guards instead of one - in reality the guards for value-in-container are costly, since they must call the container fetch operation. By contrast, the decont instruction will typically be optimized into a cheap memory dereference, and the guard to check for a type is also cheap. Furthermore, situations where the actual type of the value taken out of the container is not important currently pay for a container and value guard anyway; this situation will naturally be resolved by this change.

Code object logging

The target code object of an invocation will be logged, and a new guard for static frame of an invocation target will be added. This will enable us to do inlining or at least pre-selection of the specialized code variant in cases that today do not permit it.

@MasterDuke17
Copy link

This may be exposing my ignorance, but would it be possible to attach spesh info to pre-comp bytecode? Or does pre-compiling not run the code long enough for spesh to have learned anything useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment