atrick/resilient-dispatch.md

## resilient-dispatch.md

      
    Raw
  

              resilient-dispatch.md
            
          
    Analyzing dispatch performance

When it comes to resilient performance we need to prioritize
future flexibility and ABI simplicity over perceived performance
advantages. Being able to react to future performance issues that
arise with specific apps is far more important than benchmarking. Time
not spent implementing complex ABI mechanisms is time that will be
spent optimizing more critical performance areas.
It's easy to say we should simply measure performance and let the data
decide, but it's dangerous to make long term decisions based on
one-off, poorly designed experiments. We can still reason about the
relative strengths of each dispatch technique, keeping in mind many
dimensions of performance.
The primary "performance" concerns are, in roughly decreasing priority:
a. app binary size
b. library binary size
c. load time
d. runtime dispatch overhead
e. runtime metadata size
These all heavily depend on the workload, linker implementation, build
mode, and future hardware.
Within runtime dispatch overhead (d), we can reason about these
contributions to overhead roughly in decreasing priority, considering
that some costs can be amortized:
d1. depth of dependent loads.
d2. # external loads
d3. # indirect calls.
d4. # external calls
d5. # internal loads, calls, branches
d6. cost of the non-hoistable portion of dispatch
Background

Previously we considered these three approaches:


Expose vtables as ABI.


Export per-method dispatch entry points.


Export per-class method lookup.


I argued against #1, exposing vtables, because


It complicates the ABI and reduces flexibility.


It requires an additional mechanism to compute the vtable base, and
I strongly disagree with solutions that unconditionally introduce a
dependent load. (d1++)


It forces future runtime implementations to explode vtables by
num_base_methods * num_subclasses. (e++)


I argued against #2, per-method dispatch, because


It exports more symbols. (b++, c++)


It has no obvious support for super. We still need a some more
complex dispatch for super calls.


Amortized performance. Method lookup is not hoistable. (d6++).


Recent Considerations

We need to reconsider the argument for approach #3 now because Jordan
and Slava pointed out that we can't rely on the key feature: constant,
availability-sorted method indices. That means #3 now requires
additional per-method symbols and a dependent external load (d1++). An
alternative floated by Jordan is to use a constant method name hash,
which complicates the ABI and method lookup (d5+++, d6+++).
Taking this into consideration, #2, per-method dispatch, comes out
ahead in my opinion. It has:


The simplest ABI.


The simplest implementation (consistent with non-resilient dispatch).


The smallest app binary size (a--).


Likely superior performance in all dimensions except library size
(b++), load time (c++), and non-hoistable runtime overhead (d6++).


Hoistability is a relative non-issue. ABI simplicity, code size, and
the cost of external loads easily outweigh that.
The more interesting performance question is whether introducing
per-method symbols will have a significant impact on library size and
load time. I submit that it is not a serious problem because:


This only adds one symbol for every publicly declared
method--inherited methods won't add symbols. That seems within
normal expectations and no worse than ObjC or C++.


Symbol stripping, trie-supported symbol table and launch closures
should mostly mitigate these costs.


Furthermore, a straight-forward implementation of #3, per-class method
lookup, would require the same number of symbols. That could be
avoided by implementing #3 with a hash-based lookup, but that would be
much more complicated and costly in terms of runtime overhead (d6++).
Super call dispatch

Assuming that #2, per-method dispatch functions have the least
complexity and best performance, that only leaves support for super
calls.
Simply passing a constant flag to distinguish direct vs. dynamic
dispatch is insufficient to handle resilient base classes because
overrides can be added resiliently. Furthermore, at compile time we
don't always statically know the current superclass, since a
superclass override could be injected resiliently:
// Module A
public class Base {
  public func foo() { … }
}

public class Sub: Base {
  public override func foo() { … }
}

// Module B
extension Sub {
  func bar() { super.foo() }
}

[Jordan's example: If a new superclass is injected between 'Base' and
'Sub', then the extension of 'Sub' in module B should switch to
calling that new class' implementation of 'foo'. We probably also
want to allow the reverse situation: a subclass method overrides a
superclass method without changing the type signature, and at some
point the override no longer becomes necessary, so the implementer
removes it.]
Consequently it's necessary to dynamically load the superclass isa
and either pass that to a lookup helper or to a dispatch method.
Common SIL Implementation Tasks

Regardless of how we implement super dispatch, we likely need these
SIL extensions:
A SIL instruction to dynamically retrieve a superclass:
%typePtr = super_metatype $Sub.Type
A SIL instruction to resolve a class method based on a type
pointer instead of self:
%method = class_metatype_method %typePtr : $Super, #Super.method!1 : $@convention(method)
Method Dispatch Calling Convention Hack

It's tempting to hack the existing per-method dispatch functions to
handle super dispatch.
The obvious solution is to pass an extra isa argument to the method
dispatch function in a way that is mostly ABI compatible with an
internal method call. This would also affect normal method dispatch:
// method call from any subclass of `A` to self.foo
%self_isa = value_metatype $A.Type, %self
%foo = function_ref @A.foo : $@convention(external_method) ...
apply %foo(self, args..., %self_isa)

// method call from `B` to super.foo. `super` may be `A` or some
// intermediate subclass.
%super_isa = super_metatype $B.Type
%foo = function_ref @A.foo : $@convention(external_method) ...
apply %foo(self, args..., %super_isa)

This approach unnecessarily penalizes normal dispatch in order to
handle the extremely rare case of cross-module super calls:


It forces an isa load on the caller side increasing app size. (a++)


If forces a thunk to the actual method body along and forces either
unconditional dynamic dispatch, or an extra isa check before doing
direct dispatch, increasing library size and dispatch cost (b++, d6++)


This also requires implementing a new SIL calling convention so that
isa can be passed in a register without shuffling arguments. Note
that even normal method dispatch will now need to access isa
immediately on the callee side.
Method Lookup Implementation

Super class dispatch is naturally implemented with a per-class method
lookup, as in approach #3. We just need an alternative to relying on a
constant method index.
We have already ruled out


exporting a second per-method symbol


exporting a method descriptor as the primary per-method symbol,
which normal dispatch would need to load from.


But, since we have now exported the method dispatch function, that can
serve as an adequate lookup key. The per-class method lookup function
and dispatch function are always associated with the same class
definition. So, the lookup only needs to search for keys directly
defined in its class see extensions. If the method is
not open and there are no known overrides, then the dispatch_function
could be returned without even loading a function pointer.
// method call from any subclass of `A` to self.foo
// (this is now an ideal fast path)
%foo = function_ref @A.foo : $@convention(external_method) ...
apply %foo(self, args...)

// method call from `B` to super.foo. `super` may be `A` or some
// intermediate subclass.
%super_isa = super_metatype $B.Type
%A_foo  = function_ref @A.foo : $@convention(external_method) ...
%lookup = function_ref @A.method_lookup : $@convention(external_method) ...
%method = apply %lookup(%super_isa, %A_foo)
%result = apply %method(self, args...)

Keep in mind that cross-module super dispatch is expected to be
extremely rare--it's not a critical path and won't have any pervasive
performance impact. It would be quite simple to implement a
linear-time search with immediate comparisons that doesn't require a
lookup table. Typically, the search space will be very small--10 or so
methods. Note that 10 compare-and-branches is typically considered as
fast as an indirect call. Some outliers (e.g. corelibs NSObject) could
have 100s of methods. If this ever becomes a performance problem, then
we can address that without affecting the ABI by:


Storing a method descriptor in the text segment at a negative offset
from the dispatch function, with LLVM support.


Implementing a hash table or some other cached lookup for cases with
more than X methods.


Extensions

[John's thought on extensions]
To ensure that we have a lookup function in the correct module and
class for every possible super call, a module will need to export both
a per-class lookup function and a dispatch function whenever it
defines an "ABI-new" public non-final class method. An "ABI-new"
method is either a non-override, was already defined in a previous
version of the library, or is an override that in some way affects the
signature.
Witness tables

The really sad news, for anyone reading this far, is that this
discussion also applies to protocol methods. We can't rely on
availability of sorted witness tables either. This means that protocol
dispatch will have some resilience overhead--about the same as a
virtual call. Slava suggested exporting a dispatch symbol per method
declared by a Requirement, which is analogous to vtable approach #2.
There is a simplicity advantage to using the same resilience
mechanism for both kinds of dispatch.
The only reasonable alternative I can think of is to continue exposing
witness tables as ABI, but export a symbol for each protocol method
which would store the witness method index. I suspect that the
dependent GOT load will be more expensive than directly calling into a
thunk, but we may want to leave time in the schedule to evaluate this
performance tradeoff before ABI freeze.
I suspect that the best answer here is to allow for @inlinable witness
tables in performance critical cases. However, we don't have a plan
for that in Swift 5. This is a potential performance risk.