djspiewak/fibers.md

## fibers.md

      
    Raw
  

              fibers.md
            
          
    Fibers

Fibers are an abstraction over sequential computation, similar to threads but at a higher level. There are two ways to think about this model: by example, and abstractly from first principles. We'll start with the example.
(credit here is very much due to Fabio Labella, who's incredible Scala World talk describes these ideas far better than I can)
Callback Sequentialization

Consider the following three functions
def connect(address: InetAddress)(onConnect: Connection => Unit): Unit = ???

def write(conn: Connection, bytes: ByteBuffer)(onComplete: () => Unit): Unit = ???

def close(conn: Connection)(onClose: () => Unit): Unit = ???
We should all be relatively familiar with this pattern. This is literally how NodeJS works, but this kind of thing appears in many different forms throughout the programming landscape. The idea in each case here is that these functions are asking the runtime to perform some action (e.g. writing bytes to a socket), and then registering a listener which will be invoked when that action is complete.
When we write code which uses these functions, it tends to look a bit like this (ignoring errors):
def doThings(onComplete: () => Unit): Unit = 
  connect(remoteAddr) { conn =>
    write(conn, SynBytes) { () =>
      write(conn, HelloWorldBytes) { () =>
        close(conn)(onComplete)
      }
    }
  }
Ignore for a moment how ugly this is (we'll solve that problem in a moment), and instead focus on the conceptual nature of what we just did. We have taken four asynchronous actions – connect, two writes, and close – and joined them together one after the other, creating a sequential process comprised of each action in turn.
This is a fiber. When we view this as a fiber, it becomes easier to see how the following is not only possible but conceptually quite natural:
def connectIO(addr: InetAddress): IO[Connection] = 
  IO.async_(cb => connect(addr)(conn => cb(Right(conn))))

def writeIO(conn: Connection, bytes: ByteBuffer): IO[Unit] =
  IO.async_(cb => write(conn, bytes)(() => cb(Right(()))))

def closeIO(conn: Connection): IO[Unit] =
  IO.async(cb => close(conn)(() => cb(Right(()))))


val doThingsIO: IO[Unit] =
  for {
    conn <- connectIO(remoteAddr)
    _ <- writeIO(conn, SynBytes)
    _ <- writeIO(conn, HelloWorldBytes)
    _ <- close(conn)
  } yield ()
Cats Effect gives us the abstractions necessary to take code which logically represents a fiber (such as the original definition, doThings) and arrange and write that code in a nicely sequential style. This allows us to think in terms of fibers (like doThingsIO) and actions (like writeIO) rather than worrying about low-level details like threads and callbacks.
Abstraction Tower

The "first principles" view of fibers starts from the observation that sequential evaluation is really a question of scheduling. The machine is going to perform certain actions when we instruct it to, and the order and timing which we impose on those instructions creates the illusion of sequentiality within higher level domains.
Each domain is responsible for implementing scheduling such that the next-higher level perceives effects such as sequentialization and blocking exactly as such, rather than in terms of lower-level ideas such as interrupts and paging. It looks a little bit like this:

The concepts of "sequence" and "blocking" are reified within each subsequent level in terms of lower-level primitives within the preceding level. Thus, from the perspective of a lower level, these concepts in the higher level are essentially an illusion, while in the higher level domain they are quite real and make up the fundamental building blocks of programs.
Hardware Units

The very first layer of this illusion starts in the hardware itself. Despite the lies we tell ourselves about assembly code, processors do not actually execute the instructions we give them in the order we dole them out. (note: I'm going to ignore most of the shenanigans done by modern processors in the interest of simplicity; there are several intermediate layers within the hardware layer which themselves could be split within the diagram, but I'm choosing to gloss over this)
The reason for this is simply the nature of hardware. The evaluation of any given assembly instruction (say, iadd %r1 %r2 %r3, which is a fictitious instruction that adds the integers in registers 1 and 2, storing the result in register 3) actually requires several intermediary steps within the processor. Each of those steps is, itself, an independent unit within the silicon. For example, the iadd instruction might be implemented by the following sequence of hardware units:

The first unit fetches the bytes which make up the instruction, the second decodes those bytes to determine what the hardware should actually do, the third is a specialized processor unit which just performs twos-complement addition between two integers, while the final unit writes the results from the previous into the output register (%r3 in this case).
This is a very typical (albeit immensely simplified) flow which applies to almost all instructions within a typical RISC processor. Note that x86 is actually a CISC architecture, meaning that its assembly is much higher level and thus each instruction corresponds to many many hardware stages within the pipeline, and so what we're describing in this section applies even more dramatically in practice than in this simplified description.
Here's the interesting bit in all of this: when a hardware unit (say, the decoder) is done working on a particular instruction, it can be tasked with working on the next instruction without interfering with the subsequent stage of the previous instruction. Each processor only has one decoder, but there's no particular reason that it needs to sit idle while waiting for any other hardware units. So in other words, imagine we have the following (silly) program:
iadd %r1 %r2 %r3
iadd %r1 %r4 %r6
iadd %r2 %r5 %r7
iadd %r8 %r9 %r10

If we evaluate these instructions naively, the flow will look something like this:

In this flow, the processor "focuses" on each instruction in turn, fully completing its execution before moving on to the next one, and requiring sixteen full cycles to evaluate four instructions. However, as mentioned earlier, the decoder (for example) doesn't really need to wait for the current instruction to complete before starting to decode the next one. In fact, all four of these instructions are entirely independent, which means that the processor can safely squish it all together and keep each hardware unit fully utilized during each stage of the evaluation:

Now instead of sixteen cycles to evaluate four instructions, we only need seven. The fetching of instruction 2 can begin while we decode instruction 1, and similarly the decoding for instruction 2 can begin while we perform the addition in instruction 1.
What the processor is doing here is automatically parallelizing the execution of "sequential" assembly instructions. This optimization (known as "pipelining") is so powerful, in fact, that processors will perform very complex analyses and even completely reorder and rearrange blocks of assembly in order to better align them to utilize the hardware more efficiently.
All of this is invisible at the higher level of the program though. From the perspective of the assembly programmer, each instruction is performed sequentially, one after the other. The CPU is only allowed to play these tricks in so far as it can maintain the illusion of strong ordering.
Thus, sequential evaluation is a trick, even at the "lowest" levels of assembly. The hardware conspires to create this illusion for the next layer up.
Processors

The illusion doesn't stop there though. Each processor represents an independent execution unit, capable of evaluating a stream of instructions independently in parallel from the other processors. As we know from the previous section, even the sequentialization of those basic instruction streams is an illusion, but from this layer onward, it isn't an illusion that we need to be concerned with. By raising the level of abstraction from hardware units (such as the decoder) to assembly instructions, we have simplified all subsequent layers.
We're going to keep playing that same trick, over and over again. In the case of the processor layer, the goal is to abstract away from the concrete, physically-set number of processor cores and create the illusion that an unbounded number of threads may be evaluating simultaneously. In a sense, the goal is to take our n CPUs and allow higher-level programmers to treat the system as if it has an infinite number of them.
This is achieved using preemptive multitasking within the operating system kernel. An abstraction called a thread is exposed by the kernel to developers writing applications in user-land. Threads are independent sequences of instructions, and the system can have a very large number of these. Each thread is submitted to the kernel, which submits them to its own internal scheduler. This scheduler is responsible for paging the memory state (including the call stack and program counter) of the thread into the processor cache and registers, then requesting that the processor perform a given sequence of instructions, before interrupting that sequence once again, paging the register and cache state back out into memory and repeating the process with some other thread which has been patiently waiting its turn.
The kernel is very much in control of this process, dictating the timing of interrupts and controlling and optimizing the data being paged into and out of the processor state, even to the point of rewriting register accesses and massaging thread state so as to avoid the most expensive elements of this whenever possible.
The result is, once again, an illusion presented to the developers of user-land applications. The illusion is that threads operate entirely in parallel and have their own independent register space, call stack, cache lines, and so on. In reality, only n threads are actually executing simultaneously (where n is the number of processors), and all of them are sharing the registers and cache space within the processor, but these details are entirely invisible to application developers, and for the most part, they really aren't relevant.
This is also the first layer at which the concept of blocking becomes meaningful (I'm not counting "turning off your computer" as blocking). When a thread performs a hardware instruction which involves considerable latency on results, the kernel will deschedule that thread until the underlying hardware is able to provide the required data. An easy example of this is writing to a network socket, which is an operation which takes a considerable amount of time due to the requisite interactions with the I/O bus, networking hardware, and such. When such instructions are evaluated, the kernel preempts the thread and removes it from its assigned CPU until the underlying hardware is ready to continue. At this point, the kernel once again makes the thread available to the processor scheduler, and the cycle continues.
From the perspective of the kernel, nothing is ever really "blocked". After all, the processors are still there, and still doing things. It's not like there are electrons stuck in suspended animation within silicon pathways! However, from the perspective of the user-land application developer, the thread has been blocked in its execution, because it didn't make any progress within its sequence until some later point in time.
This is an incredibly potent idea: "blocking" at a higher level of abstraction is nothing more than "descheduling" at a lower level. We repeat this trick once again within user space to create fibers.
Threads

Threads are the most basic unit of parallelism with which most developers are quite familiar. We can request that the kernel create a very large number of them (certainly far more of them than we have physical CPUs!), and they all maintain independent memory space. We never have to worry about variables which are within the lexical scope of one thread leaking over into another thread, randomly. Even just reading that sentence feels weird to us, so complete is the illusion that the kernel has created.
However, threads are still not a particularly good unit of abstraction for us to write programs, particularly when these programs are trying to achieve very high-level goals, such as handling and responding to hundreds of thousands of incoming HTTP request per second.
One serious problem is that threads are very heavyweight. The kernel attaches a lot of state to threads, simply as a function of the way that scheduling and paging works. Additionally, higher level runtimes (such as the JVM) attach their own state to threads. Most notably, the JVM treats every thread as a garbage collection root, meaning that the GC will scan the heap starting from each thread individually whenever it attempts to free unreferenced memory. The more roots you have, the more work the GC needs to do in each cycle, and thus the slower everything gets.
Thus, while we can certainly have a lot more threads than processors, we cannot have anywhere near as many threads as HTTP requests. Another layer of abstraction is required in this tower.
The IO monad implements a user-space scheduler which provides this layer of abstraction. Each fiber is made up of a sequence of actions, where each action is either synchronous (delay) or asynchronous (async). Synchronous actions are defined using values of type () => A (where A is the result), while asynchronous actions are defined using callback registrations of type (Either[Throwable, A] => Unit) => Unit (where A is the result). These asynchronous actions are the most interesting part.
A synchronous action runs to completion (producing an A) and cannot be controlled by IO. Once you start a synchronous action (by calling the function of type () => A), you cannot reassert control until the value (A) is returned to you. This is quite different from an asynchronous action, where you request that some activity is started and register a callback which will be invoked when that activity is complete. In such a model, control is returned back to the runtime (IO) as soon as the process of starting the activity is complete. Think of this like the difference between Thread#start() and Runnable#run().
Most things within the IO runtime are modeled in terms of callbacks. This modeling gives the IO runtime full control over how dependent sequences of callbacks can be scheduled and run on the underlying threads. Thus, IO takes care of the details of allocating and assigning kernel threads. Fibers are scheduled on these carrier threads for a period of time, until they either yield their control entirely (such as when making a network call) or when the runtime decides that the fiber has been running for too long and the others need a turn (in the case of auto-yielding).
Within the higher-level model of the fiber, users don't need to know or care how many threads are involved in running their program (much less the number of CPUs!). They can simply allocate as many fibers as necessary to conform to their task (e.g. one per HTTP request), and the IO runtime takes care of optimally distributing work over the underlying carrier threads.
As with the kernel and the illusion of blocking a thread, blocking a fiber is also an illusion created by the IO runtime. When a fiber performs some asynchronous action which takes a long time to return (such as a network call), that fiber will be taken off of the carrier thread and some other fiber will be allowed to execute. When the callback is finally invoked (i.e. the network call returns with some value), IO will place the fiber back onto the scheduling queue to be picked up by a carrier thread as soon as one is available.
In practice, this is considerably simpler than it seems. Carrier threads simply live within a thread pool, and "placing the fiber back onto the scheduling queue" is usually no more complex than calling the equivalent of calling pool.execute(() => fiber.continue(result)) (where result is the value which was passed to the callback).
Conclusion

The point to all of this abstraction is to smooth out the details of fiddling with electrical potentials and gates all the way up to the point where we can reasonably think about how to properly handle hundreds of thousands of simultaneous network requests, or parallelize over data which has millions of independent units, or coordinate complex state problems without worrying about inefficiency. We can even continue layering further levels of abstraction on top of fibers, such as the sequential streams model provided by libraries like Fs2 and Monix.
The important takeaway is simply this: at each layer of abstraction, sequencing and blocking are illusions created by scheduling within the previous layer. We don't really have to think about this 99% of the time, and that's exactly the point to the abstraction. By accepting the illusion, we are given a programming model which matches our intuitions about how the runtime should behave without imposing impedance mismatch with our problem domain (such as the fact that we are not allowed to have as many threads as we would expect to have HTTP requests in a typical service at scale).