dr2chase/18597-args-results-in-registers.md

## 18597-args-results-in-registers.md

      
    Raw
  

              18597-args-results-in-registers.md
            
          
    Proposal: Passing Go arguments and results in registers.

Author(s): David Chase
Last updated: 2017-01-10
Discussion at https://golang.org/issue/18597.
Abstract

Modify the compiler, assembler, linker and runtime so that (some) arguments to function and method calls and (some) results returned from function and method calls will use registers instead of stack memory.
Background

Go currently uses stack-allocated memory for passing arguments and parameters to/from functions and methods.
This is simpler and more uniform, but somewhat slower (most other compiled programming languages use registers instead).
Prototyping suggests that using registers instead would yield a 5-10% general performance improvement, depending on the target platform (5% expected on amd64, 10% expected on ppc64).
Go has additional constraints that compiled languages like C, C++, and Fortran do not.
Simplicity has value in the Go implementation.
It’s not the only value, but it counts.
The existing implementation is very simple, and easy for the garbage collector and stack backtrace generator to deal with.
Go has a non-conservative garbage collector, which means that when execution is at a “safe point” exactly all live pointers must be located, in particular those passed as function parameters or returned as function results.
Go stacks are relocated as they grow, and this stack bound check occurs in function prologues; in practice this means that registers may need to be spilled (for stack relocation) before a frame has been constructed.
These stack checks are inserted by the assembler itself (either as a backend to the compiler or run on assembly-language source).
The garbage collector’s need to locate pointers also complicates use of callee-saved registers (this relates to the possibility of using existing ABIs) because the function that ultimately spills a callee-saved register does not know if the contents of that register are a pointer or not.
This is not an insurmountable complication, but it is a complication.
Go allows a function to return multiple results.
While this is conceptually no different from returning a structure, the lighter-weight syntax and certain programming idioms make it far more common, which suggests that it would be beneficial to use an efficient way of returning multiple results.
Go requires that stacks can be unwound and described cleanly and quickly, in particular for garbage collection.
Parts of the Go runtime assume that arguments are stored contiguously in stack memory (e.g., runtime/panic.go:deferproc).
Within Go, a function may be invoked

directly from compiled Go code;
directly from hand-written assembly language;
indirectly via a function pointer which also initializes an environment pointer;
via a “go” statement that constructs a memory image of a call frame, then allocating and starting a new goroutine that uses that frame but discards the final result;
via a “defer” statement, very similar to “go” statement;
via reflection, which constructs a memory image of a call frame, invokes the function, then receives and returns the final result.

Go currently has an uneasy relationship with debuggers; its use of goroutines (m-on-n green threads) and relocating stacks means that conventional debuggers for C/C++ can get “confused” and the current compiler/assembler’s DWARF lacks many details that would be helpful to debuggers.
Delve is somewhat more go-friendly but still suffers from inadequate Go compiler DWARF support; this needs to improve and use of registers for args ought not impede it.
See for example: golang/go#18247
Proposal

###Compiler changes
The compiler changes are mostly confined to the SSA back end, and the prototype contains most of the tricky ones already.
The opcodes are as generic as possible to reduce the amount of machine-dependent code, with actual register mapping specified in the per-machine opcode files, which is consumed by the ssa opcode emitter (gc/ssa.go) to determine whether a register is used and by the  register allocator to bind to a machine-specific register.
To get stack splitting and rescheduling right the stack<->register mapping information is passed along to the assembler.
Existing SSA combine/select ops can be used to create and dismantle structure-typed values (in particular, interfaces, strings, and slices).
The rule for assigning registers is to flatten arguments and results down to a sequence of primitive-typed values (floatX, integerY, pointer, boolean) tied to stack locations for spilling, and sequentially assign each to either a floating point or general-purpose register, as appropriate to the type.
For example, a complex32 would use a pair of floating-point registers and an interface would use a pair of general-purpose registers.
In the prototype, the number and assignment of registers to arguments/results (they use the same registers) appears in the ArchOps.go file, like so (AMD64):
// This slice determines the registers used for integer args/results
// Constraints on registers: normal entry seq uses CX to get G register.
// GC-inserted return tripwires use BX, CX, DX, R8, so that limits use for return values.
var argIregs = []string{"R10", "R11", "R12"} // plan for also , "R13", "R14", "R15"
// This slice determines the registers used for floating args/results
var argFregs = []string{"X0", "X1", "X2"} // plan for also , "X3", "X4", "X5"

Instruction scheduler is modified to place the generic "store"-into-register argument passing ops as close to the call as possible because they don’t reduce register pressure (unlike true stores).
May need to modify register allocator and machine-specific calls to take registers as inputs if this is not adequate to prevent the register allocator from stumbling on the argument registers.
Before-and-after examples needed here for normal code
###Go and defer statements
The go statement takes a function pointer and calls it in a new goroutine.
To do this it assembles a fake “caller’s stack frame” to start the function call.
Because the called function may expect args in registers that is not adequate, and a shim must be created that will call the targeted function after loading argument registers.
Because results are discarded there is no need to deal with storing results into a frame.
See runtime/proc.go:newproc and runtime/panic.go:deferproc, gopanic, and runtime/asm_ARCH.s:reflectcall.
Probably the most efficient way to implement the shim is to make it take an additional parameter that is the function to be called and rewrite the go/defer to pass that parameter.
The shim will (in-place) load the appropriate registers, then branch to the desired function without creating a new frame.
This probably requires a cloned version of reflect call that does not bother with return values, since those are discarded and will also be in the wrong place because of the additional argument on the stack.
###Runtime/reflection changes and calls through reflection
Reflection ultimately calls tricky assembly language.
For the call side, an extra parameter supplies the offsets and sizes of the various register arguments; for the result side a similar extra parameter will provide information about where to store.
Most likely this will use a 16-bit encoding, most likely the type will be 4 bits and the offset 12, and the likely types will be 0=absent, 1-6={1,2,4 byte}x{signed,unsigned},7=64i, 8=32f,9=64f, with 10-15 unused for now (128-bit values seem like a possibility for the future).
Function type data would/could include a slot for this, but it would be lazily initialized to save space (because reflection is relatively rare).
###Assembler changes
The assembler(s) need to include spill/unspill code to surround calls to the stack overflow code (which may not be stack overflow; can also be reschedule or GC).
Proposed implementation is to have compiler generate appropriate boilerplate from declarations for assembly language, and assembly will call the per-function boilerplate.
The assembler needs some way of knowing whether a function deals with this explicitly or if it should automatically insert the necessary code fragments, so there needs to be an additional function flag (along the lines of “nosplit”).
For example, for assembly function F, F.morestack performs the necessary spills, calls morestack, unspills, and jumps to entry of F.
F.entry performs spills, jumps back into F (alternately, on suitable platforms, could be one per function type).
F.result loads result registers from stack, jumps back to epilogue.
One debugger + backtrace complication is that threads that are blocked because of rescheduling, GC, or stack growth will be seen as blocked in F.morestack unless that is special-cased away.
###CALLs in assembly language
CALLs within assembly language are problematic, but they are also uncommon outside the runtime.
The main problem is that the assembler doesn't know function types, so it doesn't know the assignment of integers, and in some rare cases the call is truly indirect (there are cases in the runtime where an indirect call is used to "fool the linker").  Possible choices:

repair calls by hand (least desirable)
for all functions, compiler generates before-and-after fixup stubs, and for direct calls the assembler surrounds a call to F with calls to F.before and F.after.  This is no help for indirect calls, and also results in code generation of many stubs that will go unused.
a translation tool can scan assembly declarations in .go files and CALLs in .s files, for each package generate the per-function-type stubs needed for direct calls in .s files, and surround calls to F with Fsig.before and Fsig.after.  This avoids the need to generate per-function stubs that will almost always go unused.  The stub name could describe its effect, using a sequence of G(general), F(float), and P(pad) followed by a byte count (1-8).
a. the tool could generate these as it runs, per-package
b. the assembler could expand such calls inline
c. the linker could generate the stubs as needed
The advantage of b and c is that the annotation for indirect calls could use the same stubs, though they'd need to be hand inserted.

Note that all functions will be new-style; either the compiler will create them that way, or the assembler will have insert fixup code at function entry and exits to force them to be new-style.
Rationale

Elected to not use standard ABIs for several reasons.
First, there is a diversity of standard ABIs, so it is a lot of work (and a lot of testing) to do them all.
Second, only one of those ABIs is a reasonable match for usual case code in Go (Arm64, the common problem is the lack of support for multiple value return in registers).
Third, because there is a need for someplace to spill during stack growth, and for that someplace to be understandable by the garbage collector, it is exceedingly convenient to reuse the memory already allocated in the existing calling convention.
Fourth, the main justification for using standard ABIs is to make life easier for debuggers, but those all claim to use DWARF to locate arguments and variables (results, however, are not easily expressible in DWARF according to recent discussions) and Go’s not standard way of managing stacks forces special treatment anyway.
Go’s non-standard threads virtually guarantee the need for a medium-weight shim in any call-out to other languages, so those benefits from standard calling conventions are also not available.
Decided to pass structures field-by-field instead of as memory image, because that seemed simpler and on platforms of interest there’s enough registers (6 integer, 6 float, at least) to put everything into registers in the usual case.
Compatibility

[A discussion of the change with regard to the
compatibility guidelines.]
Implementation

[A description of the steps in the implementation, who will do them, and when.
This should include a discussion of how the work fits into Go's release cycle.]
Do most work and testing on AMD64 and PPC64le; PPC64 may be first real target.
Initial work will continue with the existing pragma-based prototype as long as possible so that only targeted methods will use the new calling conventions.
Handle results-in-registers

Handle aggregate arguments

Handle aggregate results

Get DWARF right for arguments in registers.
Not sure of story for structure-typed arguments.
Get DWARF right for results in registers.
Because deferproc and newproc don’t return the result of their calls, “enhance” means that they put arguments both in memory and in registers, and so they are indifferent to which convention their callee uses.
Enhance deferproc (per-platform assembly language?)

Enhance newproc (per-platform assembly language?)

Emit assembler stubs from Go declarations (but do not use).
At this point, it is probably necessary to attempt to use the new calling conventions for all calls.
Modify assembler to use stubs

Modify any CALL instructions in assembly language that target Go functions.
Enhance reflection calls (per-platform assembly language?
Open issues (if applicable)

[A discussion of issues relating to this proposal for which the author does not
know the solution.
This section may be omitted if there are none.]