Skip to content

Instantly share code, notes, and snippets.

@ysbaddaden
Last active October 30, 2018 15:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ysbaddaden/1e2b4e00d89f2b30ddeddb110d9e6bdb to your computer and use it in GitHub Desktop.
Save ysbaddaden/1e2b4e00d89f2b30ddeddb110d9e6bdb to your computer and use it in GitHub Desktop.

Crystal MTs (Multi Threaded schedulers)

Goals:

  • developers must only care about fibers and channels;
  • crystal runtime should start 1 upto N threads;
  • each thread can take & resume any fiber in the loop;
  • channels as user-code sync primitives;
  • also propose fiber-aware mutex/monitors;
  • additional sync primitives (e.g. IO events, ...)

Issues (performance):

  • thread synchronization is slow;
  • locks create contention points: one thread can be suspended while holding a lock, blocking other threads from progressing.

Thread Safety:

  • Schedulers must be thread-safe;
  • Event Loop must be thread-safe;
  • Garbage Collector must be thread-safe;
  • lazily initialized Crystal constants must be thread-safe;
  • sync primitives (e.g. Socket) must be thread-safe.

Proposal: Stealing Schedulers

I propose an implementationg using stealing schedulers, following the Thread Scheduling for Multiprogrammed Multiprocessors (2001) paper by Nimar S. Arora, Robert D. Blumofe and C. Greg Plaxton. The information below roughly describes the algorithm with some specific, additional, details to adapt it for Crystal fibers.

Some details also come from a proposal to Rust's May library to implement stealing schedulers & parking threads: Xudong-Huang/may#32

Main Loops

Crystal will start N threads; each thread will be capable to resume any fiber —except for the thread's main fiber, that represents the thread's main stack which should be tied to the thread's scheduler, to avoid weird situations where a thread's stack is resumed by another thread.

Each Scheduler will have a main loop, running in the thread's main stack (i.e. thread's main fiber). It shall be resumed as soon as a fiber is enqueued, to reduce contention by holding on the current Fiber that another thread could try to resume (e.g. stolen, ready event, ...).

The scheduler's main function will first try to process their internal queue, then try to steal fibers from a random scheduler, falling back to run the event loop in a non blocking manner —don't wait for events to be ready— to fill the internal queue, then eventually give up by yielding CPU time to another thread.

An alternative is to run the event loop in a specific thread, but it could pop many resumable fibers out of the event loop (as soon as they're ready) and create a contention point by having a single queue for schedulers to take fibers from. Running the event loop periodically in a starving thread should dispatch the ready events into different queues, that starving threads can then steal from. Yet, the different strategies will have to be tested and benchmarked with different scenarios.

Scheduler operations such as reschedule and yield should take fibers directly from the internal queue instead of going through the main loop, in order to swap directly from the current to the next fiber. The main loop should only be resumed when the queue is emptied, to avoid holding on the current fiber for too long.

In situations where there are more threads than runnable fibers, a thread should eventually decide to park itself, until fibers are enqueued, which should unpark at least one thread. This is implemented with a mutex/condition variable and a parked threads counter. Enqueing one or many fibers can then check whether there are parked threads and signal (or broadcast) them, to resume their main loop. The counter and checks are required to avoid costly signal/broadcast operations on each and every enqueue until they're useful.

The counter is also required, to avoid the last remaining thread from parking itself, otherwise the program would hang forever. It should instead run the event loop in a blocking manner, until there are ready events to resume normal operations.

Runnables Queue

Each thread will have its own Scheduler, each with its own dedicated queue of fibers to resume. Other schedulers can steal fibers from when their own queue is empty.

Following the aforementioned paper, each scheduler will be able to push/pop from the bottom of their own queue, and to shift from the top of other queues to steal fibers. This allows the following assumptions: the push/pop operations can never be called concurrently, and the shift operation will always be called concurrently. These constraints limit the number of concurential accesses to a minimum (i.e. only when a thread has nothing to do), thus avoid thread synchronization until we have to (e.g. limits failed atomic operations).

The paper assumes the underlying array to be "infinite", but Crystal being garbage collected, and the algorithm assuming push/pop operations are never called concurrently and are the only operations that can grow the queue, the algorithm can realloc/copy the array as needed, or simply rely on mmap —or VirtualAlloc with manual commits— to avoid the costly copy operations.

Schedulers will steal fibers from the top of the array, creating a hole at the beginning of the array. Whenever the queue is empty, its bottom must be reset to 0 (it's top), removing the hole. Since the queue is meant to be emptied (from both the top and the bottom sides) the queue should always become empty, sooner or later, avoiding the array to ever grow. Yet, enqueueing an endless stream of fibers that schedulers can't fully process could eventually fill, if not all, at least lots of memory. Scheduler algorithms should thus prefer to empty runnable queues instead of enqueueing many many fibers (i.e. avoid processing the event loop again and again).

See the paper and its proof in "Verification of a Concurrent Deque Implementation" (1999) for additional details on the algorithm.

Issues

Fiber State

A running fiber can be enqueued before its context is saved, for example using Fiber.yield, or adding the current fiber to a waitlist that could be immediately picked up from another thread that would resume it.

Since there is a time-window between enqueue and swapcontext, there is a possibility for a fiber to be resumed before its context is saved.

I propose for the makecontext & swapcontext functions to set & unset a resumable flag to tell whether a fiber can be resumed or not, such that schedulers can check/spin on the flag to check & wait until the fiber is indeed resumable.

This resumable flag will also be used to hold a dead state, that is set when the fiber's proc has returned, and means the fiber can't be resumed again. Schedulers must error out when this state is reached.

main(), main user code & schedulers' main loop

Each scheduler has a main loop, that will block execution. This is perfect for started scheduler threads, but we can't block the main thread from being executed.

The main thread runs the main user code, which has both core/stdlib initializations (e.g. class variables) and the actual user code of the running program. So:

  1. starting the scheduler's main loop will block the program from ever executing (oops);
  2. not starting the scheduler's main loop means we don't have a main loop to resume to, and the main thread's main fiber could be resumed by whatever thread (bad).

A solution to this problem is for the multithreaded scheduler logic and everything it depends on to be explicitly initialized in crystal's main function (right after GC.init), to start the scheduler threads and to spawn a fiber to run the main user code, including at_exit handlers (they may require fibers/events).

This has the following requirements, that schedulers, and everything it depends on (event loop, ...):

  • musn't depend on anything that will be initialized in the main user code, because it won't be initialized until later on; for example @@eb = Event::base.new will probably case @@eb.add_event to segfault!
  • lazily initialized (class) variables that can be accessed concurrently, must be thread-safe; for example def self.eb; @@eb ||= Event::Base.new; end can lead to multiple initializations!

Once schedulers have been started, the main thread can be paused, using a thread mutex/condition variable, and eventually be resumed when the fiber that ran the main user code signals the condition variable.

Instead of pending until the program is done, the main thread could use a timedwait condition variable to periodically collect pending fiber stacks (instead of having a fiber for that); it could also become the only thread to receive signals, avoiding scheduler threads to be interrupted; maybe even more things.

Once resumed, the main thread should tell the scheduler threads to stop, then wait for them to properly exit, before exiting the program.

Additional Papers

  • "Empirical Studies of Competitive Spinning for a Shared-Memory Multiprocessor" (1991)
  • "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms" (1996)
  • "Scheduling Multithreaded Computations by Work Stealing" (1999)
  • "Verification of a Concurrent Deque Implementation" (1999)
  • "Thread Scheduling for Multiprogrammed Multiprocessors" (2001)
  • "An optimistic approach to lock-free FIFO queues" (2008)
  • "Flat Combining and the Synchronization-Parallelism tradeoff" (2010)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment