djspiewak/cps-scala.md

## cps-scala.md

      
    Raw
  

              cps-scala.md
            
          
    What follows are some of my (very) rough thoughts on what we can and should do with respect to CPS transformation in Scala at the language level. I'll try to start with some motivation behind my thinking, as well as some rambling observations on the nature of the problem space, but don't expect too much coherence here. :-)
The Problem

Async programming is hard.
Okay let's actually be more specific than that. High-performance I/O is hard. Signal multiplexing is a powerful technique for achieving high(er) performance I/O, particularly network I/O, but the tradeoff is that, in order to utilize it, the user-space programming model must allow for suspension and resumption of sequential continuations (often called "fibers" or "coroutines"). Achieving this type of programming model without significant tradeoffs in usability is what is exceptionally hard.
If that wasn't bad enough though, these problems are inextricably conflated with another set of problem spaces which are, themselves, very difficult. In no particular order:

m:n work scheduling
Resource management and interruption
Timers
Fairness

That's at the lowest level. If you move up one more rung on the ladder you will need to answer questions about error propagation, tracing, composability, and much more.
All of this is to say that there are two broad categories that we are attempting to solve for: usability of a user-space programming model involving suspensions and continuations, and functionality of the underlying runtime system.
Monads are Fundamental, Syntax is the Problem

I've seen quite a few assertions from folks saying something like the following (slight paraphrase):

There are a lot of us who do not believe that monads are the answer.

First off, good! Question the status quo, always.
Secondly though, let's take a step back and remember what monads are. Fundamentally, monads are nothing more than runtime-dependent sequential computation. In other words, they're the same basic primitive as semicolon (;) in C-like languages. It is definitely incorrect to say that runtime-dependent sequential computation is not "the answer", since sequential computation is fundamentally at the heart of everything we do. In a sense, claiming that monads are not "the answer" is a bit like claiming that computer programming is not "the answer". I'm curious as to what the question is, in that case.
More concretely, any solution to the problem of user-land asynchronous programming is going to require some sort of control over the nature of sequential computation. This is just fundamental. For example:
foo(); bar(); baz();
If we imagine that bar is an asynchronous I/O operation, it follows that the sequentialization between it and baz (and possibly also between it and foo) must be different than the runtime primitive sequencing built into the CPU (and subsequently expressed in the JVM). Control over sequentialization is really nothing more or less than the essence of monads, which in turn implies that monads are going to be involved somewhere, whether we see them or not.
That last bit is important: whether we see them or not. There's nothing about the monadic formulation which compells us to write >>= or flatMap or for or any of that. It's really more about the way we model the computation and less about the syntax we use to manipulate that model, and this is actually a very good thing.
The reason this is a very good thing is it saves us an enormous amount of effort as we consider the possible solution space. We already have extremely mature and robust solutions to all of the problems associated with asynchronous programming in terms of monadic interfaces. These solutions work very well, are performant, are proven in production at massive scale in many companies over a long period of time, and have given rise to an emergent and growing composable ecosystem. I don't think anyone has objected to any of these properties.
What everyone seems to (justifiably) object to is the syntax used to compose programs written in a monadic style. It's very foreign, particularly to developers who were trained in and are accustomed to languages which have a direct and imperative syntax. Even relatively shallow syntactic approaches such as do-notation or for-comprehensions are insufficient to mask how very much other this style of programming is.
This is the essence of the objection "monads are not the answer". It has nothing to do with the monads themselves, and in fact any alternative solution within this space will simply end up reinventing monads under the surface. So instead of focusing on the monadic construct as the problem to solve, we should focus on solving the interaction model (which is to say, the syntax) which users employ when building programs in this fashion.
Put more succinctly: the goal of this proposal is to describe how we can build a richer and more comfortable syntax around monadic types, rather than trying to replace these types altogether. The goal is to be extensible, supporting the rich variety of effect types which commonly occur in Scala programs today. Performance optimization is considered to be a secondary goal, which will be mentioned more at the end.
Desired Syntax

I think everyone can agree that direct syntax is the most obvious way to construct sequential programs. Not to be too reductive, but if what you are trying to express is foo() followed by bar() followed by baz(), then there is effectively nothing which will be as concise, familiar, or easy to read as the following:
foo()
bar()    // imagine this is an asynchronous operation
baz()
This is entirely possible! However, syntax like this gives back one of the largest advantages provided by effect systems: higher-level composability. All major effect systems provide powerful and composable functionality on top of the effect, such as Cats Effect's race, guarantee, start, timeout, and such. With an entirely implicit "auto-coloring" approach, we lose access to this type of functionality since any value of type IO[A] would get silently coerced to an A. To make matters worse, this kind of implicit approach would run into unavoidable ambiguities with computations of type IO[IO[A]] and similar. Thus, rather than implicitly auto-coloring, most languages with this type of feature use something more like the following:
foo()
bar().await
baz()
(note: Unison stands out as a language which, notably, does not require await-style syntax for its CPS transformation. Instead, it uses fully type-directed auto-coloring. This is only possible because of its richer type reconstruction algorithm, which in turn is enabled by the fact that its type system is considerably less powerful than Scala's. For these reasons, I would not consider Unison's approach to be viable within Scala without a pretty radical reworking of the language)
More concretely, await represents a "method" on any F[A] where F is an "effect type" (more on this definition below) which "returns" the A or throws an exception (if the effect produces an error rather than a result). Of course, it doesn't actually await the results of the computation, which may require a long period of time, and instead it transforms the overall scope into continuation-passing style. Something like the following, for example (using Cats Effect types for now):
Sync[F].delay(foo()) flatMap { _ =>
  bar() flatMap { _ =>
    Sync[F].delay(baz())
  }
}
As a minor digression, the advantages of this style of syntax become much more clear when used within compound expressions. Consider:
val x = foo().await + bar().await
baz(x).await
This would transform into the following:
foo() flatMap { fooResult =>
  bar() flatMap { barResult =>
    val x = fooResult + barResult
    baz(x)
  }
}
Prior Art

Making things more concrete, this type of syntax is already supported by projects like dotty-cps-async and its Cats Effect specific wrapper, cats-effect-cps. This idea also isn't particularly new, and projects such as Monadless have been exploring this space for quite some time. dotty-cps-async is notable primarily for the fact that it supports a much broader range of transformations and strictly avoids incidentally changing the meaning of constructs in fundamental ways (such as order of operation assumptions).
All of these adapters share a common root idea: within some macro-enabled syntactic block, usually denoted by a higher-order async method, transform all constructs into something encoded in terms of monads and monadic extensions. Monadless achieves this by assuming a particular virtual syntactic interface (similar to how for-comprehensions are desugared), while dotty-cps-async relies on a type-directed transformation backed by a granular typeclass hierarchy (see here).
Note that Scala 2's -Xasync is somewhat novel in that it provides a lower-level compiler transformation which can be utilized by user-defined macros, and those macros are in turn able to utilize whatever adapter interface they require. This has some benefits, though in practice it is a very limited approach with some significant drawbacks, particularly when dealing with effect types which are richer than Future in their evaluation semantics.
In general, from a functional standpoint, I believe that dotty-cps-async represents the most advanced implementation of this idea within Scala today. It does have some notable limitations, which I believe can be overcome with direct integration support from the underlying language, but it is easily 80-90% of the way to a system which works and integrates well with the language as it stands today.
Interaction with Implicit Capabilities

It is worth noting that dotty-cps-async (and cats-effect-cps) are both designed from the assumption that the primary mechanism for encoding functionality should be implicit capabilities. This technique goes by a large number of different names, including "capability traits", "tagless", and "MTL style" (among others), but ultimately it just boils down to expressing the functionality of some type within a lexical scope by leveraging implicit values and parameters. The Cats Effect hierarchy gives some idea of the different levels of hierarchical functionality which can be exposed, where the transformation of some constructs (such as await within while) requires more advanced capabilities than simpler constructs (such as variable assignment).
In a very real sense, this effectively the same as Martin's current thinking around the future of effects within Scala, with the major difference being that Cats and Cats Effect expose a more comprehensive set of capabilities than have been considered by Martin's proposals thus far. In an ideal world, all three of these efforts will converge, with Martin's research leading to better type inference and expressiveness around implicit capabilities, CPS transformations (such as dotty-cps-async) moving into the core compiler, and mature runtimes like Cats Effect solving the substrate runtime layer.
Resolving Limitations

The primary limitation of transformations such as dotty-cps-async is the macro-based lexical block. Expanding the above example slightly:
async[IO] {
  foo()
  bar().await
  baz()
}
The above code runs today with cats-effect-cps, but the need for the outer async block is irksome. This is ultimately what triggers the CPS transformation on the block passed to it, with the results being produced as an IO[A] (where A is the type produced by the expression baz()). This style composes quite well with the language and its ecosystem today, but I think we can do a little better.
The lexical limitation becomes more apparent when we consider a program involving multiple functions, each of which has some async functionality:
def a: IO[A] =
  async[IO] {
    b.await + c.await
  }

def b: IO[Int] = 
  async[IO] { 
    foo()
    bar().await
    baz()
  }

def c: IO[Int] =
  async[IO] {
    other().await * thing().await
  }
Dipping "in and out" of async mode is somewhat annoying. In a sense, this whole program should be async, but we have no way of directly expressing this. I believe this is the limitation of macro-based approaches which can be resolved through direct compiler support. Something like the following could be one solution:
def a: IO[A] =
  async[IO] {
    b.await + c.await
  }

def b: IO[Int] @async = { 
  foo()
  bar().await
  baz()
}

def c: IO[Int] @async = {
  other().await * thing().await
}
Better! Note that this would probably still result in the same bytecode as the explicit async form, so as to avoid breaking separate compilation, but the syntax itself has been streamlined. Note that we could use a number of different approaches here, including an @async annotation on the method itself (rather than the return type).
Critically, this continues to interoperate smoothly with the rest of the language and pre-existing ecosystems built around types like Future and IO.
Can we go further? Consider something like the following:
def a: IO[A] =
  async[IO] {
    b.await + b.await
  }

def b: Int @async = { 
  foo()
  bar().await
  baz()
}

def c: Int @async = {
  other().await * thing().await
}
The elimination of the "wrapping" IO on b and c is pleasing, but it's not entirely clear how this would work. A major challenge in Scala is the fact that this type of transformation must be compatible with a large set of underlying runtime types, including Future, IO, ZIO, and more. Making the assumption that it needs to be just one of these types would create significant churn within the ecosystem, in addition to simply overlooking the complexities associated with the underlying runtime system (as noted in the preamble). For example, Future is considerably less robust and much less performant than IO, but IO is not part of the standard library, and there are a large set of libraries and frameworks which work in terms of the former, while there exists a disjoint set of libraries and frameworks which work in terms of the latter.
Ultimately, it is probably preferable to avoid creating "too much" magic, and instead rely on explicit return types. Making the async syntax itself significantly more lightweight by binding it to the function scope (through the use of the @async annotation) is a significant improvement in ergonomics and brings this type of CPS transformation into line with the approach taken in other languages such as Rust and JavaScript.
Performance Considerations

Given that the compiler will have some understanding of the nature of these types of computations, is it possible to improve on the performance of existing effect systems? Kotlin's coroutines suggest that the answer to this should be "yes", but the details are subtle.
For starters, it is important to understand where systems such as Cats Effect sacrifice performance. In particular, both Cats Effect and ZIO are entirely allocation-bound. These systems can be thought of as three distinct layers:

A surface API which captures intended program behavior as objects on the heap (e.g. new IO.Delay(...))
A coroutine interpreter which provides stack safety, resource safety, error handling, and asynchronous continuations
An m:n thread scheduler which distributes work from the previous layers across an optimal number of kernel threads

These layers are not entirely cleanly delimited, as there is some coupling and interaction between the second and third for performance reasons, but in broad strokes, this is how things work.
For reference, the second layer is the only one replaced by the Project Loom "virtual threads" enhancement to the JVM in Java 19. However, in replacing the second layer, Loom also forces a specific choice in the third layer (you must use the fork-join thread pool), and due to some prescriptive API choices, makes it much more difficult to provide a compositional solution in the first layer. CPS transformations such as async/await either supplement the first layer with direct syntax (as is the case with dotty-cps-async) or replace the first and second layer altogether (as is the case with Kotlin coroutines), which often causes some unintended consequences.
Ironically, the second layer (the coroutine interpreter) is objectively not the performance bottleneck in these systems. Significant work has been done to optimize the encoding and interpretation within this layer of all three major runtimes, and given the JIT-compiled assembly which is emitted by these interpreter loops, it is not at all clear that a direct implementation within the compiler (or runtime, for that matter) can do materially better.
In general, most of the performance cost associated with effect systems is in the first or third layers. The third layer is where costs such as page faults and contention are paid, while the first layer is where memory pressure, allocation costs, and GC load are generated. There is an important distinction here though: while the third layer can be materially improved (and has been, particularly in more recent releases of Cats Effect and ZIO), the first layer cannot.
There is another distinction here, as well, which is that the costs associated with page faults and contention tend to be about an order of magnitude worse than the costs associated with memory pressure and allocation, so in a sense, the performance cost of the first layer, while unfortunately unavoidable, is also simply less meaningful than the performance cost of the third layer. Again, ironically, the first layer is also the one most easily measured in microbenchmarks, and so it gets the majority of the attention in these types of conversations.
The problem with the first layer above is the fact that it is tied to the API. For example, the mechanism by which we compose two IOs sequentially is flatMap, which in turn promises an IO as a result. The original two IOs can be recomposed with other IOs, and the IO produced from flatMap may be manipulated in countless ways through other mechanisms. This API is very powerful and composable, but it also promises something much stronger than what is needed in all cases: it promises IO values which are manipulable.
Concretely:
val a: IO[A] = ???
val b: IO[B] = ???
val c: IO[B] = a.flatMap(_ => b)
val d: IO[A] = b.flatMap(_ => a)
val e: IO[B] = d.flatMap(_ => c)
Each of these individual IOs are first-class values, at parity in capability with the others. This capability can be seen in the definitions of d and e, which recompose a, b, and even d and c in novel ways to produce a new program just as valid as the previous ones. From a functional programming standpoint this makes tremendous sense, and as a principal, it is at the root of much of the power of modern effect systems.
It is, however, also the origin of all of the first layer's performance costs. Consider the following imperative alternative to the above:
val a = ???
val b = ???
val c = ???
We cannot even express an equivalent to d or e in the earlier snippet, because the code associated with the definitions of a, b, and c is not reified as a value on the heap. Thus, direct style is fundamentally less expressive, but that reduction in expressiveness also allows us to elide some costs. Specifically, it is possible to avoid the intermediate allocations (a and b) since they cannot be manipulated as such, and instead focus on c.
Another way of looking at this is in terms of for-comprehensions:
for {
  a <- foo
  b <- bar(a)
  c <- baz(b)
} yield c
Why do we need the results of bar(a) as a value? By that I don't mean b, I mean the IO[B] which is subsequently flatMapped to produce b. We don't. Its sole purpose is to serve as a vehicle for evaluation. Thus, if we have a structural guarantee within the program that the lack of reification of the bar(a) value is invisible to the user, we could safely remove that allocation altogether.
This is some of the intuitive motivation behind recent experiments in opaque macro-based encodings (see: Flavio Brasilas).
Unfortunately, there are some serious tradeoffs associated with this type of thing. Because there would be no intermediate value representing bar(a) in the above, there is also no mechanism through which the second and third layers of the runtime can preempt, interrupt, or otherwise control execution. This is problematic primarily because of fairness, but also because of resource leaks.
Fairness is a significant concern of any coroutine runtime since it is very common for a single coroutine to run for quite a long period of time while many other coroutines are waiting to execute. The number of coroutines within a process is typically order of magnitude hundreds of thousands or even millions, while the number of kernel threads is often in the single or double digits. Newly allocated coroutines must be able to get time on the scheduler even while previously allocated long-running coroutines are still outstanding. Without this property, systems at resource saturation will become non-responsive, introducing jitter, latency, and error into end-to-end metrics.
Cats Effect and ZIO (and Future with the batching executor) all resolve this problem by allowing any given coroutine a fixed number of steps (default: 1024) before it is compelled to yield control to other outstanding work items. If there are no outstanding work items, the coroutine is allowed to pick up where it left off. Otherwise, it goes to the back of the queue and others get their turn. These steps are, intuitively, flatMap operations. The higher the granularity of flatMaps, the sooner the coroutine will yield, the more rapidly it will observe outside signals (such as cancelation), and generally the better everything will work.
This property is defeated by any mechanism which naively removes flatMaps, such as in the example earlier with a and b. Naively, if we simply say that there is no allocation until c, then the whole thing counts as a single step, rather than three separate steps. Multiply this by the scope of an entire program and it represents a significant degradation in the runtime's ability to enforce fairness across outstanding work. This is a problem which has already caused significant discussion around early builds of Project Loom (particularly since Thread.yield is not implemented for virtual threads), and it has a serious impact on the responsiveness of systems built with runtimes like Kotlin's suspend functions.
None of this is to say that we should abandon CPS optimization! However, the transformation must be performed with some care to ensure that step tracking (and ideally also other mechanisms such as tracing and granular interruption) are preserved.
Next Steps

Ultimately, I do not believe that CPS transformation support in Scala should seek to supplant effect systems. Instead, it should be focused on solving the actual problem cited by users of these systems: the lack of direct syntax. Providing a reliable, type-directed syntactic transformation of programs written in a direct style has the potential to allow users to access the power of effect systems without corrupting their ability to evaluate and optimize execution, and doing so in a fashion which preserves separate compilation ensures interoperation and composition with existing mature ecosystems built up around these types.
The right place to start is with the typeclass hierarchy and transformation calculus of dotty-cps-async. This hierarchy is strongly inspired by Cats and Cats Effect, so that would also represent a good source of prior art. Either way, the observation is that we can characterize the capabilities of effect types and build machinery in terms of those capabilities. This in turn is the essence of Martin's proposal for effect capabilities in Scala.
Longer term, we can explore extensions of this approach (such as a batched alternative to flatMap) which allow for optimization of "hidden" intermediate stages within transformed direct-style blocks, but performance data suggest that that this is not an urgent need and the greater wins are syntactic rather than semantic in nature.