jnthn/dispatch.md Secret

## dispatch.md

      
    Raw
  

              dispatch.md
            
          
    MoarVM general dispatcher API

Overview

This proposal describes a unification that will subsume all of:

Invocation spec (deciding how to unwrap a code object to find its
underlying VM reference, or alternatively delegating to a fallback
like CALL-ME)
Standard method dispatch (today, a findmethod + invoke sequence),
along with can; the up-front construction of method caches will also
go away, which should greatly reduce the size of precomp outputs when
many classes are defined, and hopefully resolve various of the augment
propagation issues too - at least if the augment happens early enough)
Multiple dispatch (and with it the multi-dispatch cache)
Spesh plugins (and with it the plugin guard tree)
The bits of deferral support provided by setdispatcher and friends (it
will provide a much more complete solution in this area)
Specializations as produced by MoarVM spesh (and with it the argument guard
tree)
Many of the current ops that are effectively the VM's "syscalls", and so do
not really need such short op sequences. This will also make it far easier
to funnel extra optional arguments into them in the future.

In place of the above comes a single dispatch mechanism. A dispatch gives
its arguments using the existing arg op sequence, but instead of following
it by an invoke instead follows it with a dispatch, naming the dispatcher
to use. Some arguments may be aimed only at the dispatcher itself. For
example, a method dispatch will start with the method name to dispatch to.
The dispatcher will drop that first argument from the capture before it is
sent onward.
The mechanism has much in common with dispatch chains.
Spesh plugins were designed without awareness of this paper, but are
largely a realization of the same idea. However, dispatch chains alone don't
seem to have an answer to the "continued dispatch" problem that we also
need to tackle. The underlying principle still applies, however: we write a
piece of code ("the dispatch handler") that is invoked in order to decide how
to perform a dispatch, and which specifies conditions ("guards") under which
the result it calculates applies. Future dispatches consider these guards, and
chooses a result using them, only invoking the dispatch handler if the guards
go unsatisfied.
Some dispatchers will be built in to the VM; others can be written in anything
that compiles into bytecode for MoarVM. They can specify guards against the
incoming arguments (that is, assumptions that must be upheld for the dispatch
outcome to be considered valid). They can specify argument capture changes,
such as dropping incoming arguments or prepending others. And, finally, they
can defer to another dispatcher by naming it. In fact, every dispatcher that
is not built-in will defer to another, because that's how we end up in one
that knows how to actually do something.
MoarVM will provide a number of dispatchers built-in. Following the naming
convention of the BOOT types, these will all start with the prefix boot-.
They are:

boot-value - takes the initial argument of the incoming capture and
uses it as the result of the dispatch. This is useful for things like
return type checks that, assuming the value is properly typed and does
not need a decontainerization operation, just want to pass it along. (In
fact, since attribute reads can be encoded in the dispatch tree, a simple
Scalar decont can be done without any kind of call, so we'll be faster
in the non-specialized case at this. And things that really are just the
identity operation won't need to either.)
boot-constant - takes the initial argument of the incoming capture and
uses it as a constant result. After specialization, the capture value
and any pure operations producing it would be eligible for dead code
elimination, assuming they are not needed for guards.
boot-code-constant - one way to run bytecode. Takes an object
with REPR MVMCode or MVMCFunction and runs it. The thing to invoke
is the initial argument, and the rest are the arguments to the call.
The thing to invoke is treated as a constant; this is good for methods,
but less so for closures and the like.
boot-code-value - also runs bytecode. Takes an object with REPR
MVMCode or MVMCFunction and runs it. The thing to invoke is the
initial argument, and the rest are the arguments to the call. The
thing to invoke is not guarded against.
boot-syscall - the mechanism that will ultimately replace many of the
ops that exist today, and allow us to do calls to VM-provided I/O,
decoding, and so forth. This will also do the REPR-checking and similar
of those ops, meaning we'll be able to optimize it out - something we
cannot do today.
boot-by-language - looks at the first argument's HLL and sees if we
have a registered default dispatcher for that language. If so, delegates
to that. Otherwise, tries delegating to boot-code.

[Speculation: we might also have a boot-nativecall, which will replace
the special-case ops for entering a native call today.]
While boot-code is externally opaque, it may actually defer to other
dispatchers internally too. For example, code that has a specialization
will defer to a dispatcher that selects the specialization.
When we chain dispatchers, we'll look at guards that have already been
asserted and not re-assert those. Today, we can potentially end up in a
situation where we do a method dispatch, followed by a multi dispatch,
followed by a specialization selection, and each of these involves an
assertion of the types of the arguments. Spesh can collapse all of
these and either inline or produce a fastinvoke, but for the case
where the caller is not specialized, perhaps because it's too morphic,
we could expect an improvement.
Dispatcher API

Every userspace dispatcher has:

A unique ID, which is used in order to do a dispatch using it.
A dispatch callback, which is invoked when there is no matching result in
the dispatch cache. It is passed an argument capture object. It should pick
another dispatcher to delegate to as its final step (it may choose different
ones on different code-paths).
For resumable dispatches, a resume callback, which is invoked with the
dispatch resumption state if the dispatch is resumed.

Dispatcher invocation

We add the following new ops in order to perform initial dispatches:

dispatch_v str callsite ... - dispatch using the specified dispatcher,
and discard any result
dispatch_o w(obj) str callsite ... - dispatch using the specified
dispatcher, and expect an object result
dispatch_i w(int64) str callsite ... - dispatch using the specified
dispatcher, and expect an int result
dispatch_n w(num64) str callsite ... - dispatch using the specified
dispatcher, and expect a num result
dispatch_s w(str) str callsite ... - dispatch using the specified
dispatcher, and expect a string result

These ops are followed by a variable number of register indices, which
supply the arguments to pass. This is determined by looking at the
callsite object. Constant arguments are placed in registers, so there
are no direct argconst equivalents. However, constant arguments are
indicated in the callsite as such. This is strictly more powerful than
today, since there's no way to have a constant object argument today,
but that could be emitted if, for example, the child is a wval.
A method dispatch today looks like this:
decont r1, r0
findmethod r1, r1, 'foo'
prepargs <callsite>
arg_o r0
invoke_o r2, r1

Which is a 30-byte sequence under the current encoding.
With the new dispatch op, a Raku method dispatch would look something like
this (noting the name is in the callsite):
decont r1, r0
dispatch_o r2, 'raku-method', <callsite>, r1, r0

Which is a 20-byte sequence under the current encoding. If we find a way to
move the decont into the dispatch instruction, it becomes just 12 (which
provides quite an incentive to do that!) A private method dispatch would
look similar:
decont r1, r0
dispatch_o r2, 'raku-private-method', <callsite>, r1, r0

This is a notable shortening of the sequence today, because spesh plugins
are always two steps:

Resolve the plugin to a callable
Invoke that callable

So we're saving a prepargs, a speshresolve, at least one arg instruction
(since we don't have to pass the invocant to both the plugin and then the real
invocation), and the temporary it stores into. There will be similar gains for
the assignment and return value check spesh plugins used in Rakudo today.
The invoke series of ops today will use the built-in boot-by-language
dispatcher, which will:

Look at the HLL owner of the object being invoked, and see if a default
fallback dispatcher has been registered for that language. If so, it will
use that. The invokee will become the first argument to the dispatcher.
Otherwise, it will fall back to the builtin boot-code dispatcher.

Language-defined dispatchers may also, upon encountering something they do
not recognize, defer to boot-language.
Registration

In the name of dogfooding, registering a dispatcher will be done using the
boot-syscall dispatcher. The first argument is the name of the VM syscall
to make - dispatcher-register - followed by 2 required arguments (the
name of the dispatcher and the dispatch callback) and one optional argument
(the resume callback). Both of the code callbacks must be MVMCodes - that
is to say, they should already have been unwrapped from any language code
objects.
# The callsite contains the dispatcher-register constant index
const_s r0, 'raku-method'
getcode r1, the_dispatch_callback
getcode r2, the_resume_callback
dispatch_v 'boot-syscall', <callsite>, r0, r1, r2

At an nqp::op level, however, this will likely be exposed as simply
nqp::dispregister('raku-method', -> $capture { ... }).
Guarding

Within a dispatch handler callback, it is allowed to guard against:

Any object obtained from the argument capture
Any argument obtained by a guarded attribute read

In order to guard a value, it first has to be obtained via. a dispatcher-track-*
operation. This lets the dispatch mechanism know the source of a value that we guard,
so it can write the dispatch program for us. We read a value out of a capture for
guarding purposes using the dispatcher-track-arg($capture, $index)) syscall. We
can then pass it to a guard operation, also done via the boot-syscall mechanism:

dispatcher-guard-type($tracked) - guard that the value has precisely its
current type (nqp::dispguardtype)
dispatcher-guard-concreteness($tracked) - guard that the value has
its concreteness (nqp::dispguardconcreteness)
dispatcher-guard-literal($tracked) - guard that the value has its current literal
value (by object pointer comparison or obj, and value comparision for str, int, and
num) (`nqp::dispguard')
dispatcher-guard-not-literal-obj($tracked, $unwanted) - guard that the value
is never a certain other object literal (used for Raku's "this is not Nil"
assignment checking) (nqp::dispguardnotobj)

It is allowed to loop over the argument capture and install guards based upon
its content; the multi-dispatcher shall wish to do this, for example.
Guarded attribute reads

Sometimes it's useful to dig into an object's attributes and guard against
those. This is done with a guarded attribute read. This is a boot-syscall
of the form dispatcher-track-attr($obj, $class, $name) (nqp::disptrackattr).
It will guard the type and concreteness of the object being read from
(although since such guard additions check the existing set of guards, it's
not inefficient if they were already guarded against). The value of the
attribute is returned.
Capture transformation

Capture transformations are used in order to alter what we send along to
the next dispatcher. For example, in a method dispatcher we'd replace the
method name with the resolved method's code object and pass it along to the
code object invoker. The VM syscalls are:

dispatcher-drop-arg($capture, $index) - forms a new capture without the
argument at the specified index. (nqp::dispdroparg)
dispatcher-insert-arg($capture, $index, $tracked) - inserts a tracked value
into the capture at the specified index.
dispatcher-insert-arg-literal-obj($capture, $index, $object) - inserts a literal
object into the capture at the specified index (this means that the dispatch program
will record this as a constant)
dispatcher-insert-arg-literal-str($capture, $index, $str) - inserts a literal
string into the capture at the specified index (this means that the dispatch program
will record this as a constant)

All of these return a new capture and a "new" callsite (typically an interned
one that fits the new shape), and never mutate an existing one in place.
Further, while they are called against a capture object when the dispatcher
is running, they may be totally elided once the dispatch guard/operation chain
has been set up.
Passing control to another dispatcher

The dispatcher-delegate syscall (nqp::dispdelegate) takes the name of
a dispatcher and an argument capture. Control will be passed on to that
dispatcher once the current one returns. There must be exactly one call to
this; it is an error to not call it, and an error to call it more than
once. Either will result in an exception throw.
Dispatch state

Dispatch state is used in order to preserve data that is used in deferral.
This initial dispatch state set by the dispatch callback must be one of:

The incoming argument capture
A derived argument capture
A value that is guarded against

That way, the dispatch state can easily be derived during the guard tree
processing. If the runtime can prove that deferral will never happen with
this dispatch state, then it is free to elide storing it at all (a lot
like it can elide takedispatcher today).
Example Raku dispatchers

The basic Rakudo code object dispatcher

This replaces the invocation spec's role today.
nqp::dispregister('raku-invoke',
    -> $capture {
        # Obtain the target of the invocation.
        my $obj := nqp::capturepos($capture, 0);
        nqp::dispguardtype($obj);
        nqp::dispguardconcreteness($obj);
        if nqp::istype($obj, Code) && nqp::isconcrete($obj) {
            # Rakudo code object. Unwrap it and replace it in the capture.
            my $do := nqp::dispgetattr($obj, Code, '$!do');
            my $delegate-capture := nqp::dispreplacearg($capture, 0, $do);
            nqp::dispdelegate('boot-code', $delegate-capture)
        }
        else {
            # Need to do CALL-ME on it.
            my $delegate-capture = nqp::dispinsertstr($capture, 0, 'CALL-ME');
            nqp::dispdelegate('raku-method-call', $delegate-capture)
        }
    }
);

The CALL-ME case has potential to be far more efficient than is possible
today. In principle, we'll be able to inline the CALL-ME method, if it's
small enough. (Also, this should let us eliminate the pesky "Cannot invoke
object with invocation handler in this context" problem.)
Method call dispatch

This is invoked with a capture where the method name is the first argument,
followed by decont'd invocant, followed by the normal method arguments.
nqp::dispregister('raku-method-call',
    -> $capture {
        # Obtain the method name and invocant. Guard against the literal name,
        # and against the type (but not concreteness) of the method. In the
        # case we were given an argconst the first will be a no-op.
        my str $name := nqp::captureposarg_s($capture, 0);
        nqp::dispguardstr($name);
        my $invocant := nqp::capturepos($capture, 1);
        nqp::dispguardtype($invocant);

        # Use the MOP to resolve the method.
        my $meth = $invocant.HOW.find_method($invocant, $name);
        unless nqp::isconcrete($meth) {
            # Really would obtain/throw the typed exception here.
            nqp::die("No such method '$name'");
        }

        # The dispatch state for resumption is the dispatch capture, which
        # holds the name and arguments.
        nqp::dispstate($capture);

        # Before delegating onrward, we drop the name argument and replace
        # the invocant argument with the target od hte dispatch.
        my $delegate-capture := nqp::dispreplacearg(
                nqp::dispdroparg($capture, 0), 0, $meth);

        # Is this a multi-method?
        if nqp::istype($meth, Method) && $meth.is_dispatcher {
            # Yes, so delegate to multiple dispatch. We need to tack on the
            # resolved thing to invoke.
            nqp::dispdelegate('raku-multi-dispatch', $delegate-capture)
        }
        else {     
            # No, so produce an invocation result, containing the method to
            # invoke and the argument capture sans name.
            nqp::dispdelegate('raku-invoke',  $delegate-capture)
        }
    },
    'resume', -> $state, $args {
        # Discussed later
    }
));

General dispatch

This is standard invocation in Raku. It may be to a sub or multi-sub, but it's
possible it'll also be to a method. This matters because deferral is meant to
work even if it is not invoked with method syntax. It is invoked with a capture
where the first thing is the sub to call, and the rest are the arguments. This
is not resumable.
nqp::dispregister('raku-call', -> $capture {
    # Guard on the type of what we're invoking.
    my $target := nqp::captureposarg($capture, 0);
    nqp::guarddisptype($target);

    # It could be a method call in disguise. We need to send this through the
    # indirect method dispatcher, so that deferral works.
    if nqp::istype($target, Method) {
        nqp::dispdelegate('raku-indirect-method-call', $target);
    }

    # Otherwise, it might be a multiple dispatch.
    elsif nqp::istype($target, Routine) && $target.is_dispatcher {
        nqp::dispguardconcreteness(nqp::dispgetattr($target, Routine, '@!dispatchees'));
        nqp::dispdelegate('raku-multi-dispatch', $target);
    }

    # Otherwise, it's a straight invocation of what we have.
    else {
        nqp::dispdelegate('raku-invoke', $target)
    }
});

Multiple dispatch

TODO
Wrap dispatch

TODO
Resumed dispatches

The following kinds of dispatch involve resumption:

Method dispatch (continuing to walk the MRO)
Multiple dispatch (continuing to walk the candidates)
Wrap dispatch (continuing to the next wrapper)

Furthermore, these fall back on their enclosing dispatcher in some cases (but
not all). For example, a call to a wrapped multi method will need to exhaust
the wrap dispatcher, then the multi dispatcher, then the method dispatcher.
Resuming a dispatch is relatively uncommon, and we'd like the overhead of this
feature to be as low as possible in the case it is not called upon. At the
same time, the cost of deferral today is very high. For example:
class P { method m() { } }
class C is P { method m() { callsame() } }
for ^1_000_000 {
    C.m
}

Runs in 4.36s, while if we explicitly name the parent class (this form is
optimized through a spesh plugin):
class P { method m() { } }
class C is P { method m() { self.P::m } }
for ^1_000_000 {
    C.m
}

Then it runs in 0.066s, presumably thanks to full inlining by spesh. That's
not just a little faster, it's a factor of more than 65x faster! Of course,
this assumes the method does nothing, so in a realistic application it'd most
likely be amortized somewhat. But still, when modules like OO::Monitors use
wrap and callsame, there's some motivation to give it a decent speedup.
Making the problem harder still is the fact that:

We can either retain the current args or specify new args - but in the
case we specify new args, and it's a method dispatch, we need to include
the original invocant too.
One can interact with the dispatch process by calling lastcall.
One can obtain the next dispatch target using nextcallee.
Things like callsame are just subroutines that you call. OO::Monitors
actually uses this fact to work around a pre-comp bug by storing the
CORE.setting &callsame away in its local lexpad, in order that it
doesn't get lost. Even without that fun, this is still a source of
trouble, in that any callsite inside of callsame would be megamorphic
even though the actual dispatch being resumed is likley to be far less
morphic.

Dispatch state

A dispatcher that is resumable can store dispatch state. Dispatch state
is actually an argument capture derived from that supplied during the
initial dispatch. In the case the specializer can prove that this state
shall not be needed, it can completely skip the work to create it. Otherwise,
it knows it must at least keep enough information around to construct it
on demand. (Hopefully this will give us a lot more freedom than we have
today, where we must mark anything using callsame and friends as not
possible to inline. In principle, this new dispatch mechanism will allow
us to lift this limitation.)
Upon resumption

When we wish to resume a dispatch, we provide an argument. The proposed use
of this for Raku's dispatchers will be:

An argument capture when we are doing callwith or nextwith
Sentinel objects for each of callsame/nextsame, nextcallee, and
lastcall

Resuming a dispatch for the first time at a given callsite will:

Locate the dispatcher that we shall resume.
Invoke its resume callback with the current dispatch state and the
argument.
Expect it to update the dispatch state and delegate to a dispatcher
(ultimately bottoming out in a VM one).

The resume callback, much like the dispatch one, can guard against its
incoming arguments: of note, the dispatch state and the resume argument.
Furthermore, it can set an updated dispatch state. As with dispatch,
these are all considered as guard operations (including the formation of
the new dispatch state), which means that when the guards are met, the
resume callback need not be run. (We assume there is a mechanism to ask
for the state to be stored against the caller's callsite, not our current
one, to avoid the megamorphism inside of callsame and friends.)
How the dispatch state is used

When we defer, the set of candidates we shall iterate through is already
preordained based on the initial argument capture. This means we can:

Find the candidate list we'll work our way through.
Add guards expressing what it depends on.
Prepend those to the dispatch state capture, except the one we will be
now most immediately calling.
And before that, a sentinel object that indicates we already decided on
the candidates, and perhaps another that separates them from the actual
arguments.

Let's take the slow case mentioned earlier as an example.
class P { method m() { } }
class C is P { method m() { callsame() } }
for ^1_000_000 {
    C.m
}

The first time around, there's no guard set. The C.m call triggers the
dispatch callback. This finds the method, and stashes away the original
argument capture as dispatch state, which is prepended with the name. We
run C.m. It calls callsame. That triggers a resumption, which is
passed the dispatch state and the sentinel indicating we're doing a
callsame. Again, the guard set at this callsite (assuming we manage to
attach it one-up, at the site of the callsame call) is empty, so we
run the resume callback. It finds all of the ms in MRO. In this case
there's only one, so it just replaces the method name in the dispatch
state with the "end of candidates" sentinel (effectively, the remaining
candidates are the empty list). It establishes a guard against the type
of the invocant in the dispatch state and the callsame sentinel to make
sure we're safe, and then delegates the dispatch so that we invoke the
P.m.
The second time around the loop is rather easier. Since m is a constant,
the only thing we need to do in the guard set to choose a candidate is to
check the type of the invocant is C. It is. The guard tree then tells us
to fetch the $!code attribute of m. We do so. (Or we might have just
cached that constant in the guard tree.) Either way, we end up in m,
without having to run dispatch. We see the callsame. We do actually
have to call it, but its resume call then looks at the guards established
last time around and winds up following a path that invokes P.m. Thus
we don't call the resume callback again in this case. We also didn't have
to allocate a dispatcher object as we do today. It remains to be seen to what
degree we can avoid allocations of the various derived captures.
We proceed this way for some further iterations. Meanwhile, at some point, the
code is considered hot and the specializer gets to work. We first end up in
C.m. We inline callsame, and find the guard set that goes with the resume.
We walk it, which leads us to C.p, which is tiny, so we inline it. Since we
can see callsame passes the sentinel for doing a callsame, that guard goes
away. At this point, the rest pretty much stays as it is. However, we can
compile the checking guards into the body of C.m, meaning that we don't have
to interpret them, and they can instead be JITted.
Later, we inline the optimized C.m into the body of the loop. Here the party
really gets started, because we can now see precisely what dispatch state is
set, and we know the types of the values that go into it. Given the simple
control flow, this means that we know what it will be upon resumption too, and
thus we can prove and delete all of the resumption guards. That means we'll
have optimized this poor naive benchmark into an empty loop body - much like we
could with the self.P::m case, and thus it should run rather fast.
Nesting dispatchers

Roughly:

When a resume is invoked and it has run out of candidates, it can delegate to
boot-next-dispatcher. Not all need do this; a wrap dispatcher probably will
as it could be in a method dispatcher or multi dispatcher, but a method
dispatcher would have nowhere to go.
Dispatchers can, when setting up a delegation to another dispatcher, indicate
if that dispatcher should fall back to this one or not. Thus when a method
dispatcher delegates to a multi dispatcher, it can say "if you fallback, it
is to me". It remains to be seen precisely how well we can get wrap
dispatchers into this picture too without too much repetition.

Implementation notes

Overall plan


Implement enough that we can migrate spesh plugins. Migrate them. Maybe
at this part-way point we can even merge; on the other hand, we may be
polishing something that isn't exercised on all the use-cases.

Get the new dispatch instruction added and handled by validation.
Add NQP code to generate it. For now, it just gives an NYI panic.
(DONE)
Implement the new constant value flag in callsites. (DONE)
Implement cache storage. A quick, cheap way for bytecode instructions
that want to cache stuff against an instruction to do so. We'll use
it for dispatch but also for getlexstatic_o (and will spectest it
against this). May find further use cases in the future. (Current
idea: keep a pointer array in parallel with the bytecode, work out
the density we need (to a power of two) and use the instruction offset
to decide on that.  (DONE)
Refactor getlexstatic_o optimization to look at the inline cache
rather than needing a value log entry. To prove spesh can get at it. (DONE)
Detailed design of how we'll handle argument captures going forward. (DONE)
Implement dispatcher registry. (DONE)
Refactor call stack so that it can hold more than just frames, and so it always
hangs on to even promoted frames. (DONE, modulo a deopt issue being uncovered)
Implement enough to be able to produce a capture object for the dispatch
callback and invoke a dispatcher with it. (DONE)
Basic design for the dispatch program execution enough for boot-value,
boot-constant, and boot-code (the terminals), along with some simple
guards. (DONE)
Implement boot-value being invoked and recording a dispatch program, but
not actually storing it yet. (DONE)
Implement boot-value actually running the dispatch program, so as not to run
the dispatch callback. (DONE)
Implement the boot-constant dispatcher. (DONE)
Implement the boot-code-constant dispatcher. (DONE)
Implement the VM "syscall" dispatch mechanism. (DONE)
Implement registering a userspace dispatcher and calling it (but it can't do
anything useful yet). (DONE)
Add support for delegating to another dispatcher for userland dispatchers. (DONE)
Add support for setting guards in userland dispatchers. (DONE)
Add support for capture transformations in userland dispatchers. (DONE)
Try switching some spesh plugins. So far, should only work with
spesh disabled. (DONE)
Properly handle non-internable callsites in dispatch programs, and revise
rules for what we can and cannot intern. (DONE)
Implement flattening prior to dispatch, selecting a fitting callsite. (DONE)
Make sure that, with the exception of things depending on multiple
dispatch, the things migrated from spesh plugins to the dispatcher work out. (DONE)
Eliminate all spesh plugins in Raku. (DONE)
Eliminate spesh plugins in MoarVM, and their NQP tests. (DONE)
Teach spesh about dispatch and get it to do the rewrite of a dispatch
instruction into guards + linking or inlining.


Switch Rakudo method, multiple, and wrap dispatch to use the new scheme.
Update callsame and friends to use the new model too. These do need
resumption. Make sure we handle all the dispatcher nesting cases.
Switch NQP method and multiple dispatch to use the new scheme. These
don't need resumption.
Elimiante last remaining remenants of findmethod and other method caching
infrastructure; nqp::can goes along with it.
Re-tune things that need it, exploit new optimization opportunities.
Figure out how to un-break other backends, to the degree they got broken.
Figure out how to replace the spesh arg guards with this too, if that was
not already done.

Code stucture


MVM_disp_registry_* (src/disp/registry.c) - the dispatcher registry,
which is populated with the boot (VM-provided) ones at startup.
MVM_disp_boot_* (src/disp/boot.c) - implementations of the bootstrap
dispatchers that are provided by the VM
MVM_disp_record_* (src/disp/record.c) - recording of new dispatch
chain entries, including capture transforms and guard operations
MVM_disp_tree_* (src/disp/tree.c) - implementation of the dispatch
tree for faster lookups, both building the tree and walking it

Dispatch guard sets


Store the list of guard ops as the primary data. For each, we have the set
of operations, followed by a final result (constant, run C function, set up a
frame and invoke bytecode). This bit is along the lines of how spesh plugins
do it, except we plan to de-dupe them too, and are involved with captures.
But...
Steal the spesh arg guard tree approach, where we use a tree structure to
lift common prefixes. Just rebuild the tree when we get new guard ops, again
like we do in the spesh arg guard tree (it works based on us always having
the type tuples there).
The guard tree produces an integer result, which we use to look up the final
action, but also in unspecialized code to record where we ended up going,
for the purposes of spesh (again, much like spesh plugins do).
Store them off the spesh data structure on a static frame. Maybe we will
preallocate that now in any frame that contains a dispatch instruction (or
an invoke, which devolves to that), to avoid the check every time.
The same binary search approach for bytecode addresses can also work.

Capture operations

Argument representation

The setup phase of a dispatch deals in argument captures, and transformations
on those. It will receive an actual Capture object. However, typically, we'll
evaluate the "dispatch program" without forming such an object, often ending
up invoking code that then receives arguments as parameters.
MoarVM to date uses an arguments buffer that is a designated part of the work
area of a call frame. Argument instructions copy from registers into that area.
In the interpreter, that means one instruction to dispatch per argument we pass,
plus we need to take care of GC-walking that area. On the parameter handling
side, we then form an argument processing context. When dealing with captures of
arguments, or flattening arguments, those can also be referred to by the
argument processing area. Flattening arguments need us to allocate an argument
buffer too.
The dispatch instruction is different: it lists the registers that hold the
arguments. This means no argument buffer to allocate space for nor to mark in
the GC. It also means no instructions to interpret per argument on the callee
side, and that there's no (immediate) need to copy the arguments anywhere.
Ideally, we'd like to achieve:

No copying of arguments other than to their final destination in the common
case. Granted we may need to read them while interpreting guards. But we'd
like to, when possible, avoid having an argument buffer that we temporarily
copy all of the args into. Of course, in some situations, that won't happen:
flattening and when we need to go via. the dispatch setup phase and so have
the arguments in a capture object.
Argument access in a callee being polymorphic over "direct from the dispatch
instruction" or "via. flattening or a capture". Even after specialized, we'd
ideally retain the polymorphism here.
That polymorphism is cheap (e.g. free of branch instructions).

This can be achieved by modeling arguments to bind as:

A callsite object (which we don't care about in the specialized case)
An array args of MVMRegister which holds arguments
An array map of MVMuint16 which tells us which element in the argument
array to look at for a particular argument

In the case of the arguments coming directly from the dispatch instruction,
args is the work area of the caller, and map is the set of mappings in
the dispatch bytecode instruction. (Note that it will probably be a couple
of positions into that, since the first argument(s) are normally aimed only at
the dispatcher, not at the final callee.) In the case of the JIT, this can live
in the data segment.
But in the case we already expanded the arguments into a buffer (because we
formed a capture object or because of flattening), and so are not reading from
the work area but an already organized set of arguments, what should map be?
It turns out that it can be the identity array, where element 0 is 0, element
1 is 1, etc. We can have one of these globally, perhaps at some nominal size
and safepoint-reallocated with a bigger one if the program happens to do huge
flattenings.
This means that argument access to argument n for parameter binding is
args[map[n]], regardless of the source. This provides the desired bit of
polymorphism without a branch, should JIT very easily too, and the spesh/JIT
is valid whether we call it via. flattening or not.
Lifetimes

There are two lifetimes to consider for arguments:

The arguments being passed are consumed only by parameter instructions. In
this case, we know their consumption is bounded by the lifetime of the
callee's execution. Thus, we can keep any intemediate state (for example, of
expanded flattening arguemnts) on the call stack.
We need a capture object. In this case, we have to assume the arguments will
escape, and so should form a capture object.

Capture objects

When we need to reify a capture as an object, it contains:

The callsite
An argument buffer, allocated in the FSA

Any flattening arguments will have already been expanded. This will be
implemented as a new representation, rather than brining the baggage of
argument processing as found in the current CallCapture representation
(which will go away after these changes). In the interim, ops on a
CallCapture will also work on a new Capture.
Capture representation during a dispatch program

At the point we perform a dispatch, we have:

A callsite
A list of registers in the calling frame containing arguments

The common case is that the dispatch program (that is, the set of guards
and transforms set up by the dispatch callback)  will find a resolution
without having to call the dispatch callback. While the dispatch call
will produce intermediate capture objects as it goes about its work, we
wish to elide all of that work when evaluating the resulting dispatch
program. We'd also like to do as little copying around of arguments as
possible; if we end up making a call, ideally the destination of the call
will will just be pulling its arguments out of the work area of its caller
via. the mappings in the dispatch instruction.
Due to the way dispatchers work, capture transformations will center almost
almost entirely around the start of the argument list (or at least, it makes
sense to put arguments aimed just at the dispatcher first). Thus, we can
consider the argument list as being "split" into two parts:

A first part that exists as actual values for the dispatch program to use
A second part that exists as indexes into work registers

The dispatch program in a given dispatch will be analyzed to work out the
first arg left untouched by the dispatch process, and that will be the
split point for the entire dispatch. It will then allocate itself the
required amount of space in its work area on the callstack for doing the
argument manipulations before that point. If the dispatch results in an
actual call to some frame, then assuming the arguments that are passed only
consist of a tail of those in the dispatch instruction, then a pointer
into the dispatch instruction's mapping list will be used as the map for
use in parameter binding.
Dispatcher representation at runtime

We know that:

A resumable dispatcher needs storage of its dispatch state - at least, in
principle. Most efficient is to do just enough to go back and recover it
later. Dispatchers without state don't need to persist anything.
Even more fun is that one dispatch instruction can produce a stack of any
number of dispatches before we actually reach the point of running code, so
there's no direct mapping between them and frames.
Their lifetime is until we are done with the dispatch.
Need to make sure we "take" them in continuations too.

Calling is generally costly in MoarVM, and MVMFrame is quite heavyweight.
It includes things we have to preserve if we're an outer scope, as well as
things that we only care about when executing. Thus, there may be some value
to separate "callstack" from current notion of "frame". Callstack becomes a
region of memory containing the things we need as we run. A callstack entry
may reference a "full-blown" frame, but isn't automatically one.
The callstack would have entries that are:

Frames (including their work space, so we can do a single allocation here).
Specialized and unspecialized frames may have different layouts. This will
be more natural once spesh is folded into the dispatch chain mechanism.
Dispatchers (which may also have work space).
Continuation tags.
Special return handlers.

For now, we'd probably stick with frames otherwise living a life much like
they do today. A bit further into the future, however, it'd be good to give
that a revisit, so we can have less Raku-specific knowledge in the VM and
handle some things in much more elegant ways. This would also be a key step
towards the re-focus on interpreter cooperation that has been speculated.