ysbaddaden/MT.md

## MT.md

      
    Raw
  

              MT.md
            
          
    Crystal MTs (Multi Threaded schedulers)

Goals:

developers must only care about fibers and channels;
crystal runtime should start 1 upto N threads;
each thread can take & resume any fiber in the loop;
channels as user-code sync primitives;
also propose fiber-aware mutex/monitors;
additional sync primitives (e.g. IO events, ...)

Issues (performance):

thread synchronization is slow;
locks create contention points: one thread can be suspended while holding a
lock, blocking other threads from progressing.

Thread Safety:

Schedulers must be thread-safe;
Event Loop must be thread-safe;
Garbage Collector must be thread-safe;
lazily initialized Crystal constants must be thread-safe;
sync primitives (e.g. Socket) must be thread-safe.

Proposal: Stealing Schedulers

I propose an implementationg using stealing schedulers, following the
Thread Scheduling for Multiprogrammed Multiprocessors
(2001)
paper by Nimar S. Arora, Robert D. Blumofe and C. Greg Plaxton. The information
below roughly describes the algorithm with some specific, additional, details
to adapt it for Crystal fibers.
Some details also come from a proposal to Rust's May library to implement
stealing schedulers & parking threads: Xudong-Huang/may#32
Main Loops

Crystal will start N threads; each thread will be capable to resume any fiber
—except for the thread's main fiber, that represents the thread's main stack
which should be tied to the thread's scheduler, to avoid weird situations where
a thread's stack is resumed by another thread.
Each Scheduler will have a main loop, running in the thread's main stack (i.e.
thread's main fiber). It shall be resumed as soon as a fiber is enqueued, to
reduce contention by holding on the current Fiber that another thread could try
to resume (e.g. stolen, ready event, ...).
The scheduler's main function will first try to process their internal queue,
then try to steal fibers from a random scheduler, falling back to run the event
loop in a non blocking manner —don't wait for events to be ready— to fill the
internal queue, then eventually give up by yielding CPU time to another thread.
An alternative is to run the event loop in a specific thread, but it could pop
many resumable fibers out of the event loop (as soon as they're ready) and
create a contention point by having a single queue for schedulers to take fibers
from. Running the event loop periodically in a starving thread should dispatch
the ready events into different queues, that starving threads can then steal
from. Yet, the different strategies will have to be tested and benchmarked with
different scenarios.
Scheduler operations such as reschedule and yield should take fibers
directly from the internal queue instead of going through the main loop, in
order to swap directly from the current to the next fiber. The main loop should
only be resumed when the queue is emptied, to avoid holding on the current fiber
for too long.
In situations where there are more threads than runnable fibers, a thread should
eventually decide to park itself, until fibers are enqueued, which should unpark
at least one thread. This is implemented with a mutex/condition variable and a
parked threads counter. Enqueing one or many fibers can then check whether there
are parked threads and signal (or broadcast) them, to resume their main loop.
The counter and checks are required to avoid costly signal/broadcast operations
on each and every enqueue until they're useful.
The counter is also required, to avoid the last remaining thread from parking
itself, otherwise the program would hang forever. It should instead run the
event loop in a blocking manner, until there are ready events to resume normal
operations.
Runnables Queue

Each thread will have its own Scheduler, each with its own dedicated queue of
fibers to resume. Other schedulers can steal fibers from when their own queue is
empty.
Following the aforementioned paper, each scheduler will be able to push/pop from
the bottom of their own queue, and to shift from the top of other queues to
steal fibers. This allows the following assumptions: the push/pop operations can
never be called concurrently, and the shift operation will always be called
concurrently. These constraints limit the number of concurential accesses to a
minimum (i.e. only when a thread has nothing to do), thus avoid thread
synchronization until we have to (e.g. limits failed atomic operations).
The paper assumes the underlying array to be "infinite", but Crystal being
garbage collected, and the algorithm assuming push/pop operations are never
called concurrently and are the only operations that can grow the queue, the
algorithm can realloc/copy the array as needed, or simply rely on mmap —or
VirtualAlloc with manual commits— to avoid the costly copy operations.
Schedulers will steal fibers from the top of the array, creating a hole at the
beginning of the array. Whenever the queue is empty, its bottom must be reset to
0 (it's top), removing the hole. Since the queue is meant to be emptied (from
both the top and the bottom sides) the queue should always become empty, sooner
or later, avoiding the array to ever grow. Yet, enqueueing an endless stream of
fibers that schedulers can't fully process could eventually fill, if not all, at
least lots of memory. Scheduler algorithms should thus prefer to empty runnable
queues instead of enqueueing many many fibers (i.e. avoid processing the event
loop again and again).
See the paper and its proof in "Verification of a Concurrent Deque
Implementation" (1999) for
additional details on the algorithm.
Issues

Fiber State

A running fiber can be enqueued before its context is saved, for example using
Fiber.yield, or adding the current fiber to a waitlist that could be
immediately picked up from another thread that would resume it.
Since there is a time-window between enqueue and swapcontext, there is a
possibility for a fiber to be resumed before its context is saved.
I propose for the makecontext & swapcontext functions to set & unset a
resumable flag to tell whether a fiber can be resumed or not, such that
schedulers can check/spin on the flag to check & wait until the fiber is
indeed resumable.
This resumable flag will also be used to hold a dead state, that is set when
the fiber's proc has returned, and means the fiber can't be resumed again.
Schedulers must error out when this state is reached.
main(), main user code & schedulers' main loop

Each scheduler has a main loop, that will block execution. This is perfect for
started scheduler threads, but we can't block the main thread from being
executed.
The main thread runs the main user code, which has both core/stdlib
initializations (e.g. class variables) and the actual user code of the running
program. So:

starting the scheduler's main loop will block the program from ever
executing (oops);
not starting the scheduler's main loop means we don't have a main loop to
resume to, and the main thread's main fiber could be resumed by whatever
thread (bad).

A solution to this problem is for the multithreaded scheduler logic and
everything it depends on to be explicitly initialized in crystal's main
function (right after GC.init), to start the scheduler threads and to spawn a
fiber to run the main user code, including at_exit handlers (they may require
fibers/events).
This has the following requirements, that schedulers, and everything it depends
on (event loop, ...):

musn't depend on anything that will be initialized in the main user code,
because it won't be initialized until later on; for example @@eb = Event::base.new will probably case @@eb.add_event to segfault!
lazily initialized (class) variables that can be accessed concurrently, must
be thread-safe; for example def self.eb; @@eb ||= Event::Base.new; end can
lead to multiple initializations!

Once schedulers have been started, the main thread can be paused, using a thread
mutex/condition variable, and eventually be resumed when the fiber that ran the
main user code signals the condition variable.
Instead of pending until the program is done, the main thread could use a
timedwait condition variable to periodically collect pending fiber stacks
(instead of having a fiber for that); it could also become the only thread to
receive signals, avoiding scheduler threads to be interrupted; maybe even more
things.
Once resumed, the main thread should tell the scheduler threads to stop, then
wait for them to properly exit, before exiting the program.
Additional Papers


"Empirical Studies of Competitive Spinning for a Shared-Memory Multiprocessor" (1991)
"Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms" (1996)
"Scheduling Multithreaded Computations by Work Stealing" (1999)
"Verification of a Concurrent Deque Implementation" (1999)
"Thread Scheduling for Multiprogrammed Multiprocessors" (2001)
"An optimistic approach to lock-free FIFO queues" (2008)
"Flat Combining and the Synchronization-Parallelism tradeoff" (2010)