aarzilli/locktorecv.md

## locktorecv.md

      
    Raw
  

              locktorecv.md
            
          
    Proposal: runtime: allow N goroutines to be simultaneously locked to the same OS thread

Summary

The aim of this proposal is to extend the LockOSThread mechanism so that it is possible to create a N:1 association between N goroutines and 1 OS thread, by adding the following function to the runtime package
func LockToRecvOSThread[T any](ch <- chan T)

When LockToRecvOSThread is called the calling goroutine will be locked to the same thread as the goroutine receiving on ch, which is assumed to be already locked to a OS thread.
Programs could use this interface to work around the performance problems described in issue #21827 in some cases.
Motivation

Operating systems and C libraries will sometimes insist that all calls be made from the same OS thread. Examples of this include macOS's GUI API, OpenGL and ptrace on Linux.
This is introduces some difficulties in Go, where goroutines are normally scheduled randomly on a pool of OS threads. The indicated way to deal with such APIs is to call LockOSThread, which will create a 1:1 association between a goroutine and a thread.
Users of LockOSThread however also opt out of Go's concurrency paradigm, all calls to the special API have to happen in the same goroutine and the solution to the problem has to be written in a sequential style, even if a concurrent solution would be otherwise easier to implement.
This leads users of LockOSThread to sometime create what I will call a "syscall server" goroutine:
func syscallServer() chan <- func() {
	ch := make(chan func())
	go func() {
		runtime.LockOSThread()
		defer runtime.UnlockOSThread()
		for fn := range ch {
			fn()
		}
	}()
	return ch
}

This arrangement allows programs that need to use LockOSThread to be written in a normal concurrent style. Delve has one such facility and github.com/faiface/mainthread is a library implementing this technique in a general way.
The problem with this approach is that the interaction of LockOSThread and channel send imposes a much greater performance penalty in the program than it would be expected from simply using a channel. Normal use of a channel is fast because it doesn't involve the OS scheduler and the Go scheduler has special optimized paths for it. But when LockOSThread is involved both the full cost of Go scheduler and the full cost of the OS scheduler are paid every time a send on the channel happens.
On some workloads this overhead can be as high as 60% of the execution time.
The proposed LockToRecvOSThread API would eliminate the need of a syscall server and allow programs that need to use a API that needs LockOSThread to place as many goroutines as they need on the blessed thread, improving their performance.
Case study: Delve

It could be argued that the syscall server is simply a common anti-pattern: you picked the wrong level of granularity for a costly operation (which you could have reasonably expected to be costly).
However I have one example, Delve, where (I think) getting rid of the syscall server would reduce the code quality.
Delve supports call injection, it's possible for the user to write something like:
(dlv) call f(x) + i

and Delve will evaluate the f(x) + i expression by resuming execution of the target program as needed to evaluate the call to functionf. When this happens three goroutines are primarily involved:

the aforementioned "syscall server" goroutine
a goroutine running the event loop of Delve, which takes care of stop, resume and breakpoint handling (which could happen concurrently to the call injection)
a goroutine running a recursive AST interpreter for the expression passed to call, this goroutine will occasionally ask the "event loop" goroutine to resume the target process to progress call injections.

Both the event loop goroutine and interpreter goroutine will issue requests to the syscall server to read and write memory and to resume execution of the target program.
To rewrite this so that there is no need for a syscall server the expression interpreter would have to be rewritten, likely in a continuation passing style, which would make it considerably less understandable.
Behavior of LockToRecvOSThread

The goroutine calling LockToRecvOSThread will be locked to the same goroutine as the goroutine that is currently receiving on ch:

if ch is nil or closed LockToRecvOSThread will panic
if no goroutine is receiving from ch LockToRecvOSThread will wait until one appears
if the goroutine calling LockToRecvOSThread is already locked to a thread LockToRecvOSThread will panic
if the goroutine receiving from ch did not call LockOSThread, LockToRecvOSThread will panic

The thread lock acquired by LockToRecvOSThread can be released by calling UnlockOSThread.
Possible problems

Because of how LockOSThread is currently implemented this proposal requires a non-trivial change to Go scheduler, a complex, performance sensitive piece of code that few people understand well.
It could be that this change will degrade the performance of other programs.
It could be that the assumptions about what causes the negative interaction between channel operations and LockOSThread are wrong and this does not solve the problem.
It could be that most programs that have the problem described in issue #21827 can be easily refactored to have a different granularity of requests to the "syscall server" goroutine and this is only actually useful to Delve.
It could be that the problem in issue #21827 can be solved by other optimizations, that do not require new API and user program changes.