Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Concurrency in Swift: One approach

Concurrency in Swift: One approach

Author: Chris Lattner

Contents

Introduction

This document is published in the style of a "Swift evolution manifesto", outlining a long-term view of how to tackle a very large problem. It explores one possible approach to adding a first-class concurrency model to Swift, in an effort to catalyze positive discussion that leads us to a best-possible design. As such, it isn't an approved or finalized design prescriptive of what Swift will end up adopting. It is the job of public debate on the open source swift-evolution mailing list to discuss and iterate towards that ultimate answer, and we may end up with a completely different approach.

We focus on task-based concurrency abstractions commonly encountered in client and server applications, particularly those that are highly event driven (e.g. responding to UI events or requests from clients). This does not attempt to be a comprehensive survey of all possible options, nor does it attempt to solve all possible problems in the space of concurrency. Instead, it outlines a single coherent design thread that can be built over the span of years to incrementally drive Swift to further greatness.

Concurrency in Swift 1...4

So far, Swift was carefully designed to avoid most concurrency topics, because we specifically did not want to cut off any future directions. Instead, Swift programmers use OS abstractions (like GCD, pthreads, etc) to start and manage tasks. The design of GCD and Swift's trailing closure syntax fit well together, particularly after the major update to the GCD APIs in Swift 3.

While Swift has generally stayed away from concurrency topics, it has made some concessions to practicality. For example, ARC reference count operations are atomic, allowing references to classes to be shared between threads. Weak references are also guaranteed to be thread atomic, Copy-On-Write (🐮) types like Array and String are sharable, and the runtime provides some other basic guarantees.

Goals and non-goals of this manifesto

Concurrency is a broad and sweeping concept that can cover a wide range of topics. To help scope this down a bit, here are some non-goals for this proposal:

  • We are focusing on task based concurrency, not data parallelism. This is why we focus on GCD and threads as the baseline, while completely ignoring SIMD vectorization, data parallel for loops, etc.
  • In the systems programming context, it is important for Swift developers to have low-level opt-in access to something like the C or C++ memory consistency model. This is definitely interesting to push forward, but is orthogonal to this work.
  • We are not discussing APIs to improve existing concurrency patterns (e.g. atomic integers, better GCD APIs, etc).

So what are the actual goals? Well, because it is already possible to express concurrent apps with GCD, our goal is to make the experience far better than it is today by appealing to the core values of Swift: we should aim to reduce the programmer time necessary to get from idea to a working and efficient implementation. In particular, we aim to improve the concurrency story in Swift along these lines:

  • Design: Swift should provide (just) enough language and library support for programmers to know what to reach for when a concurrent abstractions are needed. There should be a structured "right" way to achieve most tasks.
  • Maintenance: The use of those abstractions should make Swift code easier to reason about. For example, it is often difficult to know what data is protected by which GCD queue and what the invariants are for a heap based data structure.
  • Safety: Swift's current model provides no help for race conditions, deadlock and other concurrency problems. Completion handlers can get called on a surprising queue. These issues should be improved, and we would like to get to a "safe by default" programming model.
  • Scalability: Particularly in server applications, it is desirable to have hundreds of thousands of tasks that are active at a time (e.g. one for every active client of the server).
  • Performance: As a stretch goal, it would be great to improve performance, e.g. by reducing the number of synchronization operations performed, and perhaps even reducing the need for atomic accesses on many ARC operations. The compiler should be aided by knowing how and where data can cross task boundaries.
  • Excellence: More abstractly, we should look to the concurrency models provided by other languages and frameworks, and draw together the best ideas from wherever we can get them, aiming to be better overall than any competitor.

That said, it is absolutely essential that any new model coexists with existing concurrency constructs and existing APIs. We cannot build a conceptually beautiful new world without also building a pathway to get existing apps into it.

Why a first class concurrency model?

It is clear that the multicore world isn't the future: it is the present! As such, it is essential for Swift to make it straight-forward for programmers to take advantage of hardware that is already prevalent in the world. At the same time, it is already possible to write concurrent programs: since adding a concurrency model will make Swift more complicated, we need a strong justification for that complexity. To show opportunity for improvement, let's explore some of the pain that Swift developers face with the current approaches. Here we focus on GCD since almost all Swift programmers use it.

Asynchronous APIs are difficult to work with

Modern Cocoa development involves a lot of asynchronous programming using closures and completion handlers, but these APIs are hard to use. This gets particularly problematic when many asynchronous operations are used, error handling is required, or control flow between asynchronous calls is non-trivial.

There are many problems in this space, including the "pyramid of doom" that frequently occurs:

func processImageData1(completionBlock: (result: Image) -> Void) {
    loadWebResource("dataprofile.txt") { dataResource in
        loadWebResource("imagedata.dat") { imageResource in
            decodeImage(dataResource, imageResource) { imageTmp in
                dewarpAndCleanupImage(imageTmp) { imageResult in
                    completionBlock(imageResult)
                }
            }
        }
    }
}

Error handling is particularly ugly, because Swift's natural error handling mechanism cannot be used. You end up with code like this:

func processImageData2(completionBlock: (result: Image?, error: Error?) -> Void) {
    loadWebResource("dataprofile.txt") { dataResource, error in
        guard let dataResource = dataResource else {
            completionBlock(nil, error)
            return
        }
        loadWebResource("imagedata.dat") { imageResource, error in
            guard let imageResource = imageResource else {
                completionBlock(nil, error)
                return
            }
            decodeImage(dataResource, imageResource) { imageTmp, error in
                guard let imageTmp = imageTmp else {
                    completionBlock(nil, error)
                    return
                }
                dewarpAndCleanupImage(imageTmp) { imageResult in
                    guard let imageResult = imageResult else {
                        completionBlock(nil, error)
                        return
                    }
                    completionBlock(imageResult)
                }
            }
        }
    }
}

Partially because asynchronous APIs are onerous to use, there are many APIs defined in a synchronous form that can block (e.g. UIImage(named: ...)), and many of these APIs have no asynchronous alternative. Having a natural and canonical way to define and use these APIs will allow them to become pervasive. This is particularly important for new initiatives like the Swift on Server group.

What queue am I on?

Beyond being syntactically inconvenient, completion handlers are problematic because their syntax suggests that they will be called on the current queue, but that is not always the case. For example, one of the top recommendations on Stack Overflow is to implement your own custom async operations with code like this (Objective-C syntax):

- (void)asynchronousTaskWithCompletion:(void (^)(void))completion;
{
  dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{

    // Some long running task you want on another thread

    dispatch_async(dispatch_get_main_queue(), ^{
      if (completion) {
        completion();
      }
    });
  });
}

Note how it is hard coded to call the completion handler on the main queue. This is an insidious problem that can lead to surprising results and bugs like race conditions. For example, since a lot of iOS code already runs on the main queue, you may have been using an API built like this with no problem. However, a simple refactor to move that code to a background queue will introduce a really nasty problem where the code will queue hop implicitly - introducing subtle undefined behavior!

There are several straight-forward ways to improve this situation like better documentation or better APIs in GCD. However, the fundamental problem here is that there is no apparent linkage between queues and the code that runs on them. This makes it difficult to design for, difficult to reason about and maintain existing code, and makes it more challenging to build tools to debug, profile, and reason about what is going wrong, etc.

Shared mutable state is bad for software developers

Lets define "Shared mutable state" first: "state" is simply data used by the program. "Shared" means the data is shared across multiple tasks (threads, queues, or whatever other concurrency abstraction is used). State shared by itself is not harmful: so long as no-one is modifying the data, it is no problem having multiple readers of that data.

The concern is when the shared data is mutable, and therefore someone is changing it while others tasks are looking at it. This opens an enormous can of worms that the software world has been grappling with for many decades now. Given that there are multiple things looking at and changing the data, some sort of synchronization is required or else race conditions, semantic inconsistencies and other problems are raised.

The natural first step to start with are mutexes or locks. Without attempting to survey the full body of work around this, I'll claim that locking and mutexes introduce a number of problems: you need to ensure that data is consistently protected by the right locks (or else bugs and memory safety issues result), determine the granularity of locking, avoid deadlocks, and deal with many other problems. There have been a number of attempts to improve this situation, notably synchronized methods in Java (which were later imported into Objective-C). This sort of thing improves the syntactic side of the equation but doesn't fix the underlying problem.

Once an app is working, you then run into performance problems, because mutexes are generally very inefficient - particularly when there are many cores and threads. Given decades of experience with this model, there are a number of attempts to solve certain corners of the problem, including readers-writer locks, double-checked locking, low-level atomic operations and advanced techniques like read/copy/update. Each of these improves on mutexes in some respect, but the incredible complexity, unsafety, and fragility of the resulting model is itself a sign of a problem.

With all that said, shared mutable state is incredibly important when you're working at the level of systems programming: e.g. if you're implementing the GCD API or a kernel in Swift, you absolutely must be able to have full ability to do this. This is why it is ultimately important for Swift to eventually define an opt-in memory consistency model for Swift code. While it is important to one day do this, doing so would be an orthogonal effort and thus is not the focus of this proposal.

I encourage anyone interested in this space to read Is Parallel Programming Hard, And, If So, What Can You Do About It?. It is a great survey developed by Paul E. McKenney who has been driving forward efforts to get the Linux kernel to scale to massively multicore machines (hundreds of cores). Besides being an impressive summary of hardware characteristics and software synchronization approaches, it also shows the massive complexity creep that happens when you start to care a lot about multicore scalability with pervasively shared mutable state.

Shared mutable state is bad for hardware

On the hardware side of things, shared mutable state is problematic for a number of reasons. In brief, the present is pervasively multicore - but despite offering the ability to view these machines as shared memory devices, they are actually incredibly NUMA / non-uniform.

To oversimplify a bit, consider what happens when two different cores are trying to read and write the same memory data: the cache lines that hold that data are arbitrated by (e.g.) the MESI protocol, which only allows a cache line to be mutable in a single processor's L1 cache. Because of this, performance quickly falls off of a cliff: the cache line starts ping-pong'ing between the cores, and mutations to the cache line have to be pushed out to other cores that are simply reading it.

This has a number of other knock on effects: processors have quickly moved to having relaxed consistency models which make shared memory programming even more complicated. Atomic accesses (and other concurrency-related primitives like compare/exchange) are now 20-100x slower than non-atomic accesses. These costs and problems continue to scale with core count, yet it isn't hard to find a large machine with dozens or hundreds of cores today.

If you look at the recent breakthroughs in hardware performance, they have come from hardware that has dropped the goal of shared memory. Notably, GPUs have been extremely successful at scaling to extremely high core counts, notably because they expose a programming model that encourages the use of fast local memory instead of shared global memory. Supercomputers frequently use MPI for explicitly managed memory transfers, etc. If you explore this from first principles, the speed of light and wire delay become an inherently limiting factor for very large shared memory systems.

The point of all of this is that it is highly desirable for Swift to move in a direction where Swift programs run great on large-scale multi-core machines. With any luck, this could unblock the next step in hardware evolution.

Shared mutable state doesn't scale beyond a single process

Ok, it is somewhat tautological, but any model built on shared mutable state doesn't work in the absence of shared memory.

Because of this, the software industry has a complexity explosion of systems for interprocess communication: things like sockets, signals, pipes, MIG, XPC, and many others. Operating systems then invariably introduce variants of the same abstractions that exist in a single process, including locks (file locking), shared mutable state (memory mapped files), etc. Beyond IPC, distributed computation and cloud APIs then reimplement the same abstractions in yet-another way, because shared memory is impractical in that setting.

The key observation here is simply that this is a really unfortunate state of affairs. A better world would be for app developers to have a way to build their data abstractions, concurrency abstractions, and reason about their application in the large, even if it is running across multiple machines in a cloud ecosystem. If you want your single process app to start running in an IPC or distributed setting, you should only have to teach your types how to serialize/code themselves, deal with new errors that can arise, then configure where you want each bit of code to run. You shouldn't have to rewrite large parts of the application - certainly not with an entirely new technology stack.

After all, app developers don't design their API with JSON as the input and output format for each function, so why should cloud developers?

Overall vision

This manifesto outlines several major steps to address these problems, which can be added incrementally to Swift over the span of years. The first step is quite concrete, but subsequent steps get increasingly vague: this is an early manifesto and there is more design work to be done. Note that the goal here is not to come up with inherently novel ideas, it is to pull together the best ideas from wherever we can get them, and synthesize those ideas into something self-consistent that fits with the rest of Swift.

The overarching observation here is that there are four major abstractions in computation that are interesting to build a model on top of:

  • traditional control flow
  • asynchronous control flow
  • message passing and data isolation
  • distributed data and compute

Swift already has a fully-developed model for the first point, incrementally refined and improved over the course of years, so we won't talk about it here. It is important to observe that the vast majority of low-level computation benefits from imperative control flow, mutation with value semantics, and yes, reference semantics with classes. These concepts are the important low-level primitives that computation is built on, and reflect the basic abstraction of CPUs.

Asynchrony is the next fundamental abstraction that must be tackled in Swift, because it is essential to programming in the real world where we are talking to other machines, to slow devices (spinning disks are still a thing!), and looking to achieve concurrency between multiple independent operations. Furthermore, latency of apparently identical operations is sometimes subject to significant jitter, examples include: networks dropping a packet (retry after timeout) and by fast path/slow path optimizations (e.g. caches).

Fortunately, Swift is not the first language to face these challenges: the industry as a whole has fought this dragon and settled on async/await as the right abstraction. We propose adopting this proven concept outright (with a Swift spin on the syntax). Adopting async/await will dramatically improve existing Swift code, dovetailing with existing and future approaches to concurrency.

The next step is to define a programmer abstraction to define and model the independent tasks in a program, as well as the data that is owned by those tasks. We propose the introduction of a first-class actor model, which provides a way to define and reason about independent tasks who communicate between themselves with asynchronous message sending. The actor model has a deep history of strong academic work and was adopted and proven in Erlang and Akka, which successfully power a large number of highly scalable and reliable systems. With the actor model as a baseline, we believe we can achieve data isolation by ensuring that messages sent to actors do not lead to shared mutable state.

Speaking of reliable systems, introducing an actor model is a good opportunity and excuse to introduce a mechanism for handling and partially recovering from runtime failures (like failed force-unwrap operations, out-of-bounds array accesses, etc). We explore several options that are possible to implement and make a recommendation that we think will be a good for UI and server applications.

The final step is to tackle whole system problems by enabling actors to run in different processes or even on different machines, while still communicating asynchronously through message sends. This can extrapolate out to a number of interesting long term possibilities, which we briefly explore.

Part 1: Async/await: beautiful asynchronous APIs

NOTE: This section is concrete enough to have a fully baked proposal. From a complexity perspective, it is plausible to get into Swift 5, we just need to determine whether it is desirable, then if so, debate and refine the proposal as a community.

No matter what global concurrency model is settled on for Swift, it is hard to ignore the glaring problems we have dealing with asynchronous APIs. Asynchronicity is unavoidable when dealing with independently executing systems: e.g. anything involving I/O (disks, networks, etc), a server, or even other processes on the same system. It is typically "not ok" to block the current thread of execution just because something is taking a while to load. Asynchronicity also comes up when dealing with multiple independent operations that can be performed in parallel on a multicore machine.

The current solution to this in Swift is to use "completion handlers" with closures. These are well understood but also have a large number of well understood problems: they often stack up a pyramid of doom, make error handling awkward, and make control flow extremely difficult.

There is a well-known solution to this problem, called async/await. It is a popular programming style that was first introduced in C# and was later adopted in many other languages, including Python, Javascript, Scala, Hack, Dart, etc. Given its widespread success and acceptance by the industry, I suggest that we do the obvious thing and support this in Swift.

async/await design for Swift

The general design of async/await drops right into Swift, but a few tweaks makes it fit into the rest of Swift more consistently. We suggest adding async as a function modifier akin to the existing throws function modifier. Functions (and function types) can be declared as async, and this indicates that the function is a coroutine. Coroutines are functions that may return normally with a value, or may suspend themselves and internally return a continuation.

This approach allows the completion handler to be absorbed into the language. For example, before you might write:

func loadWebResource(_ path: String, completionBlock: (result: Resource) -> Void) { ... }
func decodeImage(_ r1: Resource, _ r2: Resource, completionBlock: (result: Image) -> Void)
func dewarpAndCleanupImage(_ i : Image, completionBlock: (result: Image) -> Void)

func processImageData1(completionBlock: (result: Image) -> Void) {
    loadWebResource("dataprofile.txt") { dataResource in
        loadWebResource("imagedata.dat") { imageResource in
            decodeImage(dataResource, imageResource) { imageTmp in
                dewarpAndCleanupImage(imageTmp) { imageResult in
                    completionBlock(imageResult)
                }
            }
        }
    }
}

whereas now you can write:

func loadWebResource(_ path: String) async -> Resource
func decodeImage(_ r1: Resource, _ r2: Resource) async -> Image
func dewarpAndCleanupImage(_ i : Image) async -> Image

func processImageData1() async -> Image {
    let dataResource  = await loadWebResource("dataprofile.txt")
    let imageResource = await loadWebResource("imagedata.dat")
    let imageTmp      = await decodeImage(dataResource, imageResource)
    let imageResult   = await dewarpAndCleanupImage(imageTmp)
    return imageResult
}

await is a keyword that works like the existing try keyword: it is a noop at runtime, but indicate to a maintainer of the code that non-local control flow can happen at that point. Besides the addition of the await keyword, the async/await model allows you to write obvious and clean imperative code, and the compiler handles the generation of state machines and callback handlers for you.

Overall, adding this will dramatically improve the experience of working with completion handlers, and provides a natural model to compose futures and other APIs on top of. More details are contained in the full proposal.

New asynchronous APIs

The introduction of async/await into the language is a great opportunity to introduce more asynchronous APIs to Cocoa and perhaps even entire new framework extensions (like a revised asynchronous file I/O API). The Server APIs Project is also actively working to define new Swift APIs, many of which are intrinsically asynchronous.

Part 2: Actors: Eliminating shared mutable state

Given the ability define and use asynchronous APIs with expressive "imperative style" control flow, we now look to give developers a way to carve up their application into multiple concurrent tasks. We propose adopting the model of actors: Actors naturally represent real-world concepts like "a document", "a device", "a network request", and are particularly well suited to event driven architectures like UI applications, servers, device drivers, etc.

So what is an actor? As a Swift programmer, it is easiest to think of an actor as a combination of a DispatchQueue, the data that queue protects, and messages that can be run on that queue. Because they are embodied by an (internal) queue abstraction, you communicate with Actors asynchronously, and actors guarantee that the data they protect is only touched by the code running on that queue. This provides an "island of serialization in a sea of concurrency".

It is straight-forward to adapt legacy software to an actor interface, and it is possible to progressively adopt actors in a system that is already built on top of GCD or other concurrency primitives.

Actor Model Theory

Actors have a deep theoretical basis and have been explored by academia since the 1970s - the wikipedia page on actors and the c2 wiki page are good places to start reading if you'd like to dive into some of the theoretical fundamentals that back the model. A challenge of this work (for Swift's purposes) is that academia assumes a pure actor model ("everything is an actor"), and assumes a model of communication so limited that it may not be acceptable for Swift. I'll provide a broad stroke summary of the advantages of this pure model, then talk about how to address the problems.

As Wikipedia says:

In response to a message that it receives, an actor can: make local decisions, create more actors, send more messages, and determine how to respond to the next message received. Actors may modify private state, but can only affect each other through messages (avoiding the need for any locks).

Actors are cheap to construct and you communicate with an actor using efficient unidirectional asynchronous message sends ("posting a message in a mailbox"). Because these messages are unidirectional, there is no waiting, and thus deadlocks are impossible. In the academic model, all data sent in these messages is deep copied, which means that there is no data sharing possible between actors. Because actors cannot touch each other's state (and have no access to global state), there is no need for any synchronization constructs, eliminating all of the problems with shared mutable state.

To make this work pragmatically in the context of Swift, we need to solve several problems:

  • we need a strong computational foundation for all the computation within a task. Good news: this is already done in Swift 1...4!
  • unidirectional async message sends are great, but inconvenient for some things. We want a model that allows messages to return a value (even if we encourage them not to), which requires a way to wait for that value. This is the point of adding async/await.
  • we need to make message sends efficient: relying on a deep copy of each argument is not acceptable. Fortunately - and not accidentally - we already have Copy-On-Write (🐮) value types and move semantics on the way as a basis to build from. The trick is dealing with reference types, which are discussed below.
  • we need to figure out what to do about global mutable state, which already exists in Swift. One option is considered below.

Example actor design for Swift

There are several possible ways to manifest the idea of actors into Swift. For the purposes of this manifesto, I'll describe them as a new type in Swift because it is the least confusing way to explain the ideas and this isn't a formal proposal. I'll note right here up front that this is only one possible design: the right approach may be for actors to be a special kind of class, a model described below.

With this design approach, you'd define an actor with the actor keyword. An actor can have any number of data members declared as instance members, can have normal methods, and extensions work with them as you'd expect. Actors are reference types and have an identity which can be passed around as a value. Actors can conform to protocols and otherwise dovetail with existing Swift features as you'd expect.

We need a simple running example, so lets imagine we're building the data model for an app that has a tableview with a list of strings. The app has UI to add and manipulate the list. It might look something like this:

  actor TableModel {
    let mainActor : TheMainActor
    var theList : [String] = [] {
      didSet {
        mainActor.updateTableView(theList)
      }
    }
    
    init(mainActor: TheMainActor) { self.mainActor = mainActor }

    // this checks to see if all the entries in the list are capitalized:
    // if so, it capitalize the string before returning it to encourage
    // capitalization consistency in the list.
    func prettify(_ x : String) -> String {
      // Details omitted: it inspects theList, adjusting the
      // string before returning it if necessary.
    }

    actor func add(entry: String) {
      theList.append(prettify(entry))
    }
  }

This illustrates the key points of an actor model:

  • The actor defines the state local to it as instance data, in this case the reference to mainActor and theList is the data in the actor.
  • Actors can send messages to any other actor they have a reference to, using traditional dot syntax.
  • Normal (non-actor) methods can be defined on the actor for convenience, and they have full access to the state within their self actor.
  • actor methods are the messages that actors accept. Marking a method as actor imposes certain restrictions upon it, described below.
  • It isn't shown in the example, but new instances of the actor are created by using the initializer just like any other type: let dataModel = TableModel(mainActor).
  • Also not shown in the example, but actor methods are implicitly async, so they can freely call async methods and await their results.

It has been found in other actor systems that an actor abstraction like this encourage the "right" abstractions in applications, and map well to the conceptual way that programmers think about their data. For example, given this data model it is easy to create multiple instances of this actor, one for each document in an MDI application.

This is a straight-forward implementation of the actor model in Swift and is enough to achieve the basic advantages of the model. However, it is important to note that there are a number of limitations being imposed here that are not obvious, including:

  • An actor method cannot return a value, throw an error, or have an inout parameter.
  • All of the parameters must produce independent values when copied (see below).
  • Local state and non-actor methods may only be accessed by methods defined lexically on the actor or in an extension to it (whether they are marked actor or otherwise).

Extending the model through await

The first limitation (that actor methods cannot return values) is easy to address as we've already discussed. Say the app developer needs a quick way to get the number of entries in the list, a way that is visible to other actors they have running around. We should simply allow them to define:

  extension TableModel {
    actor func getNumberOfEntries() -> Int {
      return theList.count
    }
  }

This allows them to await the result from other actors:

  print(await dataModel.getNumberOfEntries())

This dovetails perfectly with the rest of the async/await model. It is unrelated to this manifesto, but we'll observe that it would be more idiomatic way to define that specific example is as an actor var. Swift currently doesn't allow property accessors to throw or be async. When this limitation is relaxed, it would be straight-forward to allow actor vars to provide the more natural API.

Note that this extension makes the model far more usable in cases like this, but erodes the "deadlock free" guarantee of the actor model. An await on an actor method suspends the current task, and since you can get circular waits, you can end up with deadlock. This is because only one message is processed by the actor at a time. The simples case occurs if an actor waits on itself directly (possibly through a chain of references):

  extension TableModel {
    actor func f() {
       ...
       let x = await self.getNumberOfEntries()   // trivial deadlock.
       ...
    }
  }

The trivial case like this can also be trivially diagnosed by the compiler. The complex case would ideally be diagnosed at runtime with a trap, depending on the runtime implementation model.

The solution for this is to encourage people to use Void-returning actor methods that "fire and forget". There are several reasons to believe that these will be the most common: the async/await model described syntactically encourages people not to use it (by requiring marking, etc), many of the common applications of actors are event-driven applications (which are inherently one way), the eventual design of UI and other system frameworks can encourage the right patterns from app developers, and of course documentation can describe best practices.

About that main thread

The example above shows mainActor being passed in, following theoretically pure actor hygiene. However, the main thread in UIKit and AppKit are already global state, so we might as well admit that and make code everywhere nicer. As such, it makes sense for AppKit and UIKit to define and vend a public global constant actor reference, e.g. something like this:

public actor MainActor {  // Bikeshed: could be named "actor UI {}"
   private init() {}      // You can't make another one of these.
   // Helpful public stuff could be put here to make app developers happy. :-)
}
public let mainActor = MainActor()

This would allow app developers to put their extensions on MainActor, making their code more explicit and clear about what needs to be run on the main thread. If we got really crazy, someday Swift should allow data members to be defined in extensions on classes, and App developers would then be able to put their state that must be manipulated on the main thread directly on the MainActor.

Data isolation

The way that actors eliminate shared mutable state and explicit synchronization is through deep copying all of the data that is passed to an actor in a message send, and preventing direct access to actor state without going through these message sends. This all composes nicely, but can quickly introduce inefficiencies in practice because of all the data copying that happens.

Swift is well positioned to deal with this for a number of reasons: its strong focus on value semantics means that copying of these values is a core operation understood and known by Swift programmers everywhere. Second, the use of Copy-On-Write (🐮) as an implementation approach fits perfectly with this model. Note how, in the example above, the DataModel actor sends a copy of the theList array back to the UI thread so it can update itself. In Swift, this is a super efficient O(1) operation that does some ARC stuff: it doesn't actually copy or touch the elements of the array.

The third piece, which is still in development, will come as a result of the work on adding ownership semantics to Swift. When this is available, advanced programmers will have the ability to move complex values between actors, which is typically also a super-efficient O(1) operation.

This leaves us with three open issues: 1) how do we know whether something has proper value semantics, 2) what do we do about reference types (classes and closures), and 3) what do we do about global state. All three of these options should be explored in detail, because there are many different possible answers to these. I will explore a simple model below in order to provide an existence proof for a design, but I do not claim that it is the best model we can find.

Does a type provide proper value semantics?

This is something that many many Swift programmers have wanted to be able to know the answer to, for example when defining generic algorithms that are only correct in the face of proper value semantics. There have been numerous proposals for how to determine this, and I will not attempt to summarize them, instead I'll outline a simple proposal just to provide an existence proof for an answer:

  • Start by defining a simple marker protocol (the name of which is intentionally silly to reduce early bikeshedding) with a single requirement: protocol ValueSemantical { func valueSemanticCopy() -> Self }
  • Conform all of the applicable standard library types to ValueSemantical. For example, Array conforms when its elements conform - note that an array of reference types doesn't always provide the semantics we need.
  • Teach the compiler to synthesize conformance for structs and enums whose members are all ValueSemantical, just like we do for Codable.
  • The compiler just checks for conformance to the ValueSemantical protocol and rejects any arguments and return values that do not conform.

To reiterate, the name ValueSemantical really isn't the right name for this: things like UnsafePointer, for example, shouldn't conform. Enumerating the possible options and evaluating the naming tradeoffs between them is a project for another day though.

It is important to realize that this design does not guarantee memory safety. Someone could implement the protocol in the wrong way (thus lying about satisfying the requirements) and shared mutable state could occur. In the author's opinion, this is the right tradeoff: solving this would require introducing onerous type system mechanics (e.g. something like the capabilities system in the Pony language). Swift already provides a model where memory safe APIs (e.g. Array) are implemented in terms of memory unsafety (e.g. UnsafePointer), the approach described here is directly analogous.

Alternate Design: Another approach is to eliminate the requirement from the protocol: just use the protocol as a marker, which is applied to types that already have the right behavior. When it is necessary to customize the copy operation (e.g. for a reference type), the solution would be to box values of that type in a struct that provides the right value semantics. This would make it more awkward to conform, but this design eliminates having "another kind of copy" operation, and encourages more types to provide value semantics.

Reference types: Classes

The solution to this is simple: classes need to conform to ValueSemantical (and implement the requirement) properly, or else they cannot be passed as a parameter or result of an actor method. In the author's opinion, giving classes proper value semantics will not be that big of a deal in practice for a number of reasons:

  • The default (non-conformance) is the right default: the only classes that conform will be ones that a human thought about.
  • Retroactive conformance allows app developers to handle cases not addressed by the framework engineers.
  • Cocoa has a number of classes (e.g. the entire UI frameworks) that are only usable on the main thread. By definition, these won't get passed around.
  • A number of classes in Cocoa are already semantically immutable, making it trivial and cheap for them to conform.

Beyond that, when you start working with an actor system, it is an inherent part of the application design that you don't allocate and pass around big object graphs: you allocate them in the actor you intend to manipulate them with. This is something that has been found true in Scala/Akka for example.

Reference types: Closures and Functions

It is not safe to pass an arbitrary value with function type across an actor message, because it could close over arbitrary actor-local data. If that data is closed over by-reference, then the recipient actor would have arbitrary access to data in the sending actor's state. That said, there is at least one important exception that we should carve out: it is safe to pass a closure literal when it is known that it only closes over data by copy: using the same ValueSemantical copy semantics described above.

This happens to be an extremely useful carveout, because it permits some interesting "callback" abstractions to be naturally expressed without tight coupling between actors. Here is a silly example:

    otherActor.doSomething { self.incrementCount($0) }

In this case OtherActor doesn't have to know about incrementCount which is defined on the self actor, reducing coupling between the actors.

Global mutable state

Since we're friends, I'll be straight with you: there are no great answers here. Swift and C already support global mutable state, so the best we can do is discourage the use of it. We cannot automatically detect a problem because actors need to be able to transitively use random code that isn't defined on the actor. For example:

func calculate(thing : Int) -> Int { ... }

actor Foo {
  actor func exampleOperation() {
     let x = calculate(thing: 42)
     ...
  }
}

There is no practical way to know whether 'calculate' is thread-safe or not. The only solution is to scatter tons of annotations everywhere, including in headers for C code. I think that would be a non-starter.

In practice, this isn't as bad as it sounds, because the most common operations that people use (e.g. print) are already internally synchronizing, largely because people are already writing multithreaded code. While it would be nice to magically solve this long standing problem with legacy systems, I think it is better to just completely ignore it and tell developers not to define or use global variables (global lets are safe).

All hope is not lost though: Perhaps we could consider deprecating global vars from Swift to further nudge people away from them. Also, any accesses to unsafe global global mutable state from an actor context can and should be warned about. Taking some steps like this should eliminate the most obvious bugs.

Scalable Runtime

Thus far, we've dodged the question about how the actor runtime should be implemented. This is intentional because I'm not a runtime expert! From my perspective, building on top of GCD is great if it can work for us, because it is proven and using it reduces risk from the concurrency design. I also think that GCD is a reasonable baseline to start from: it provides the right semantics, it has good low-level performance, and it has advanced features like Quality of Service support which are just as useful for actors as they are for anything else. It would be easy to provide access to these advanced features by giving every actor a gimmeYourQueue() method.

Here are some potential issues using GCD which we will need to be figure out:

Kernel Thread Explosions

Our goal is to allow actors to be used as a core unit of abstraction within a program, which means that we want programmers to be able to create as many of them as they want, without running into performance problems. If scalability problems come up, you end up having to aggregate logically distinct stuff together to reduce # actors, which leads to complexity and loses some of the advantages of data isolation. The model as proposed should scale exceptionally well, but depends on the runtime to make this happen in practice.

GCD is already quite scalable, but one concern is that it can be subject to kernel thread explosions, which occur when a GCD task blocks in a way that the kernel and runtime cannot reason about. In response, the GCD runtime allocates new kernel threads, each of which get a stack... and these stacks can fragment the heap. This is problematic in the case of a server workload that wants to instantiate hundreds of thousands of actors - at least one for every incoming network connection.

Provably solving thread explosions is probably impossible/impractical in any runtime given the need to interoperate with C code and legacy systems that aren't built in pure Swift. That said, perfection isn't necessary: we just need a path that moves towards it, and provides programmers a way to "get their job done" when an uncooperative framework or API is hit in practice. I'd suggest a three step approach to resolving this:

  • Make existing frameworks incrementally "async safe" over time. Ensure that new APIs are done right, and make sure that no existing APIs ever go from “async safe” to “async unsafe”.
  • Provide a mechanism that developers can use to address problematic APIs that they encounter in practice. It should be something akin to “wrap your calls in a closure and pass it to a special GCD function”, or something else of similar complexity.
  • Continue to improve perf and debugger tools to help identify problematic cases that occur in practice.

This approach of focusing on problematic APIs that developers hit in practice should work particularly well for server workloads, which are the ones most likely to need a large number of actors at a single time. Legacy server libraries are also much more likely to be async friendly than arbitrary other C code.

Actor Shutdown

There are also questions about how actors are shut down. The conceptually ideal model is that actors are implicitly released when their reference count drops to zero and when the last enqueued message is completed. This will probably require some amount of runtime integration.

Bounded Queue Depths

Another potential concern is that GCD queues have unbounded depth: if you have a producer/consumer situation, a fast producer can outpace the consumer and continuously grow the queue of work. It would be interesting to investigate options for providing bounded queues that throttle or block the producer in this sort of situation. Another option is to make this purely an API problem, encouraging the use of reactive streams and other abstractions that provide back pressure.

Alternative Design: Actors as classes

The design above is simple and self consistent, but may not be the right model, because actors have a ton of conceptual overlap with classes. Observe:

  • Actors have reference semantics, just like classes.
  • Actors form a graph, this means that we need to be able to have weak/unowned references to them.
  • Subclassing of actors makes just as much sense as subclassing of classes, and would work the same way.
  • Some people incorrectly think that Swift hates classes: this is an opportunity to restore some of their former glory.

However, actors are not simple classes: here are some differences:

  • Only actors can have actor methods on them. These methods have additional requirements put on them in order to provide the safety in the programming model we seek.
  • An "actor class" deriving from a "non-actor base class" would have to be illegal, because the base class could escape self or escape local state references in an unsafe way.

One important pivot-point in discussion is whether subclassing of actors is desirable. If so, modeling them as a special kind of class would be a very nice simplifying assumption, because a lot of complexity comes in with that (including all the initialization rules etc). If not, then defining them as a new kind of type is defensible, because they'd be very simple and being a separate type would more easily explain the additional rules imposed on them.

Syntactically, if we decided to make them classes, it makes sense for this to be a modifier on the class definition itself, since actorhood fundamentally alters the contract of the class, e.g.:

actor class DataModel : SomeBaseActor { ... }

alternatively, since you can't derive from non-actor classes anyway, we could just make the base class be Actor:

class DataModel : Actor { ... }

Further extensions

The design sketch above is the minimal but important step forward to build concurrency abstractions into the language, but really filling out the model will almost certainly require a few other common abstractions. For example:

  • Reactive streams is a common way to handle communication between async actors, and helps provide solutions to backpressure. Dart's stream design is one example.

  • Relatedly, it makes sense to extend the for/in loop to asynchronous sequences - likely through the introduction of a new AsyncSequence protocol. FWIW, this is likely to be added to C# 8.0.

  • A first class Future type is commonly requested. I expect the importance of it to be far less than in languages that don't have (or who started without) async/await, but it is still a very useful abstraction for handling cases where you want to kick off simple overlapping computations within a function.

Intra-actor concurrency

Another advanced concept that could be considered is allowing someone to define a "multithreaded actor", which provides a standard actor API, but where synchronization and scheduling of tasks is handled by the actor itself, using traditional synchronization abstractions instead of a GCD queue. Adding this would mean that there is shared mutable state within the actor, but that isolation between actors is still preserved. This is interesting to consider for a number of reasons:

  • This allows the programming model to be consistent (where an "instance of an actor represents a thing") even when the thing can be implemented with internal concurrency. For example, consider an abstraction for a network card/stack: it may want to do its own internal scheduling and prioritizing of many different active pieces of work according to its own policies, but provide a simple-to-use actor API on top if that. The fact that the actor can handle multiple concurrent requests is an implementation detail the clients shouldn’t have to be rewritten to understand.

  • Making this non-default would provide proper progressive disclosure of complexity.

  • You’d still get improved safety and isolation of the system as a whole, even if individual actors are “optimized” in this way.

  • When incrementally migrating code to the actor model, this would make it much easier to provide actor wrappers for existing concurrent subsystems built on shared mutable state (e.g. a database whose APIs are threadsafe).

  • Something like this would also probably be the right abstraction for imported RPC services that allow for multiple concurrent synchronous requests.

  • This abstraction would be unsafe from the memory safety perspective, but this is widely precedented in Swift. Many safe abstractions are built on top of memory unsafe primitives - consider how Array is built on UnsafePointer - and this is an important part of the pragmatism and "get stuff done" nature of the Swift programming model.

That said, this is definitely a power-user feature, and we should understand, build, and get experience using the basic system before considering adding something like this.

Part 3: Reliability through fault isolation

Swift has many aspects of its design that encourages programmer errors (aka software bugs :-) to be caught at compile time: a static type system, optionals, encouraging covered switch cases, etc. However, some errors may only be caught at runtime, including things like out-of-bound array accesses, integer overflows, and force-unwraps of nil.

As described in the Swift Error Handling Rationale, there is a tradeoff that must be struck: it doesn't make sense to force programmers to write logic to handle every conceivable edge case: even discounting the boilerplate that would generate, that logic is likely to itself be poorly tested and therefore full of bugs. We must carefully weigh and tradeoff complex issues in order to get a balanced design. These tradeoffs are what led to Swift's approach that does force programmers to think about and write code to handle all potentially-nil pointer references, but not to have to think about integer overflow on every arithmetic operation. The new challenge is that integer overflow still must be detected and handled somehow, and the programmer hasn't written any recovery code.

Swift handles these with a fail fast philosophy: it is preferable to detect and report a programmer error as quickly as possible, rather than "blunder on" with the hope that the error won't matter. Combined with rigorous testing (and perhaps static analysis technology in the future), the goal is to make bugs shallow, and provide good stack traces and other information when they occur. This encourages them to be found and fixed quickly early in the development cycle. However, when the app ships, this philosophy is only great if all the bugs were actually found, because an undetected problem causes the app to suddenly terminate itself.

Sudden termination of a process is hugely problematic if it jeopardizes user data, or - in the case of a server app - if there are hundreds of clients currently connected to the server at the time. While it is impossible in general to do perfect resolution of an arbitrary programmer error, there is prior art for how handle common problems gracefully. In the case of Cocoa, for example, if an NSException propagates up to the top of the runloop, it is useful to try to save any modified documents to a side location to avoid losing data. This isn't guaranteed to work in every case, but when it does, the user is very happy that they haven't lost their progress. Similarly, if a server crashes handling one of its client's requests, a reasonable recovery scheme is to finish handling the other established connections in the current process, but push off new connection requests to a restarted instance of the server process.

The introduction of actors is a great opportunity to improve this situation, because actors provide an interesting granularity level between the "whole process" and "an individual class" where programmers think about the invariants they are maintaining. Indeed, there is a bunch of prior art in making reliable actor systems, and again, Erlang is one of the leaders (for a great discussion, see Joe Armstrong's PhD thesis). We'll start by sketching the basic model, then talk about a potential design approach.

Actor Reliability Model

The basic concept here is that an actor that fails has violated its own local invariants, but that the invariants in other actors still hold: this because we've defined away shared mutable state. This gives us the option of killing the individual actor that broke its invariants instead of taking down the entire process. Given the definition of the basic actor model with unidirectional async message sends, it is possible to have the runtime just drop any new messages sent to the actor, and the rest of the system can continue without even knowing that the actor crashed.

While this is a simple approach, there are two problems:

  • Actor methods that return a value could be in the process of being awaited, but if the actor has crashed those awaits will never complete.
  • Dropping messages may itself cause deadlock because of higher-level communication invariants that are broken. For example, consider this actor, which waits for 10 messages before passing on the message:
  actor Merge10Notifications {
    var counter : Int = 0
    let otherActor = ...  // set up by the init.
    actor func notify() {
      counter += 1
      if counter >= 10 {
        otherActor.notify()
      }
    }
  }

If one of the 10 actors feeding notifications into this one crashes, then the program will wait forever to get that 10th notification. Because of this, someone designing a "reliable" actor needs to think about more issues, and work slightly harder to achieve that reliability.

Opting into reliability

Given that a reliable actor requires more thought than building a simple actor, it is reasonable to look for opt-in models that provide progressive disclosure of complexity. The first thing you need is a way to opt in. As with actor syntax in general, there are two broad options: first-class actor syntax or a class declaration modifier, i.e., one of:

  reliable actor Notifier { ... }
  reliable actor class Notifier { ... }

When one opts an actor into caring about reliability, a new requirement is imposed on all actor methods that return a value: they are now required to be declared throws as well. This forces clients of the actor to be prepared for a failure when/if the actor crashes.

Implicitly dropping messages is still a problem. I'm not familiar with the approaches taken in other systems, but I imagine two potential solutions:

  1. Provide a standard library API to register failure handlers for actors, allowing higher level reasoning about how to process and respond to those failures. An actor's init() could then use this API to register its failure handler the system.
  2. Force all actor methods to throw, with the semantics that they only throw if the actor has crashed. This forces clients of the reliable actor to handle a potential crash, and do so on the granularity of all messages sent to that actor.

Between the two, the first approach is more appealing to me, because it allows factoring out the common failure logic in one place, rather than having every caller have to write (hard to test) logic to handler the failure in a fine grained way. For example, a document actor could register a failure handler that attempts to save its data in a side location if it ever crashes.

That said, both approaches are feasible and should be explored in more detail.

Alternate design: An alternate approach is make all actors be "reliable" actors, by making the additional constraints a simple part of the actor model. This reduces the number of choices a Swift programmer gets-to/has-to make. If the async/await model ends up making async imply throwing, then this is probably the right direction, because the await on a value returning method would be implicitly a try marker as well.

Reliability runtime model

Besides the high level semantic model that the programmer faces, there are also questions about what the runtime model is. When an actor crashes:

  • What state is its memory left in?
  • How well can the process clean up from the failure?
  • Do we attempt to release memory and other resources (like file descriptors) managed by that actor?

There are multiple possible designs, but I advocate for a design where no cleanup is performed: if an actor crashes, the runtime propagates that error to other actors and runs any recovery handlers (as described in the previous section) but that it should not attempt further clean up the resources owned by the actor.

There are a number of reasons for this, but the most important is that the failed actor just violated its own consistency with whatever invalid operation it attempted to perform. At this point, it may have started a transaction but not finished it, or may be in any other sort of inconsistent or undefined state. Given the high likelihood for internal inconsistency, it is probable that the high-level invariants of various classes aren't intact, which means it isn't safe to run the deinit-ializers for the classes.

Beyond the semantic problems we face, there are also practical complexity and efficiency issues at stake: it takes code and metadata to be able to unwind the actor's stack and release active resources. This code and metadata takes space in the application, and it also takes time at compile time to generate it. As such, the choice to provide a model that attempted to recover from these sorts of failures would mean burning significant code size and compile time for something that isn't supposed to happen.

A final (and admittedly weak) reason for this approach is that a "too clean" cleanup runs the risk that programmers will start treating fail-fast conditions as a soft error that doesn't need to be handled with super-urgency. We really do want these bugs to be found and fixed in order to achieve the high reliability software systems that we seek.

Part 4: Improving system architecture

As described in the motivation section, a single application process runs in the context of a larger system: one that often involves multiple processes (e.g. an app and an XPC daemon) communicating through IPC, clients and servers communicating through networks, and servers communicating with each other in "the cloud" (using JSON, protobufs, GRPC, etc...). The points of similarity across all of these are that they mostly consist of independent tasks that communicate with each other by sending structured data using asynchronous message sends, and that they cannot practically share mutable state. This is starting to sound familiar.

That said, there are differences as well, and attempting to papering over them (as was done in the older Objective-C "Distributed Objects" system) leads to serious problems:

  • Clients and servers are often written by different entities, which means that APIs must be able to evolve independently. Swift is already great at this.
  • Networks introduce new failure modes that the original API almost certainly did not anticipate. This is covered by "reliable actors" described above.
  • Data in messages must be known-to-be Codable.
  • Latency is much higher to remote systems, which can impact API design because too-fine-grained APIs perform poorly.

In order to align with the goals of Swift, we cannot sweep these issues under the rug: we want to make the development process fast, but "getting something up and running" isn't the goal: it really needs to work - even in the failure cases.

Design sketch for interprocess and distributed compute

The actor model is a well-known solution in this space, and has been deployed successfully in less-mainstream languages like Erlang. Bringing the ideas to Swift just requires that we make sure it fits cleanly into the existing design, taking advantage of the characteristics of Swift and ensuring that it stays true to the principles that guide it.

One of these principles is the concept of progressive disclosure of complexity: a Swift developer shouldn't have to worry about IPC or distributed compute if they don't care about it. This means that actors should opt-in through a new declaration modifier, aligning with the ultimate design of the actor model itself, i.e., one of:

  distributed actor MyDistributedCache { ... }
  distributed actor class MyDistributedCache { ... }

Because it has done this, the actor is now subject to two additional requirements.

  • The actor must fulfill the requirements of a reliable actor, since a distributed actor is a further refinement of a reliable actor. This means that all value returning actor methods must throw, for example.
  • Arguments and results of actor methods must conform to Codable.

In addition, the author of the actor should consider whether the actor methods make sense in a distributed setting, given the increased latency that may be faced. Using coarse grain APIs could be a significant performance win.

With this done, the developer can write their actor like normal: no change of language or tools, no change of APIs, no massive new conceptual shifts. This is true regardless of whether you're talking to a cloud service endpoint over JSON or an optimized API using protobufs and/or GRPC. There are very few cracks that appear in the model, and the ones that do have pretty obvious reasons: code that mutates global state won't have that visible across the entire application architecture, files created in the file system will work in an IPC context, but not a distributed one, etc.

The app developer can now put their actor in a package, share it between their app and their service. The major change in code is at the allocation site of MyDistributedCache, which will now need to use an API to create the actor in another process instead of calling its initializer directly. If you want to start using a standard cloud API, you should be able to import a package that vends that API as an actor interface, allowing you to completely eliminate your code that slings around JSON blobs.

New APIs required

The majority of the hard part of getting this to work is on the framework side, for example, it would be interesting to start building things like:

  • New APIs need to be built to start actors in interesting places: IPC contexts, cloud providers, etc. These APIs should be consistent with each other.
  • The underlying runtime needs to be built, which handles the serialization, handshaking, distributed reference counting of actors, etc.
  • To optimize IPC communications with shared memory (mmaps), introduce a new protocol that refines ValueSemantical. Heavy weight types can then opt into using it where it makes sense.
  • A DSL that describes cloud APIs should be built (or an existing one adopted) to autogenerate the boilerplate necessary to vend an actor API for a cloud service.

In any case, there is a bunch of work to do here, and it will take multiple years to prototype, build, iterate, and perfect it. It will be a beautiful day when we get here though.

Part 5: The crazy and brilliant future

Looking even farther down the road, there are even more opportunities to eliminate accidental complexity by removing arbitrary differences in our language, tools, and APIs. You can find these by looking for places with asynchronous communications patterns, message sending and event-driven models, and places where shared mutable state doesn't work well.

For example, GPU compute and DSP accelerators share all of these characteristics: the CPU talks to the GPU through asynchronous commands (e.g. sent over DMA requests and interrupts). It could make sense to use a subset of Swift code (with new APIs for GPU specific operations like texture fetches) for GPU compute tasks.

Another place to look is event-driven applications like interrupt handlers in embedded systems, or asynchronous signals in Unix. If a Swift script wants to sign up for notifications about SIGWINCH, for example, it should be easy to do this by registering your actor and implementing the right method.

Going further, a model like this begs for re-evaluation of some long-held debates in the software community, such as the divide between microkernels and monolithic kernels. Microkernels are generally considered to be academically better (e.g. due to memory isolation of different pieces, independent development of drivers from the kernel core, etc), but monolithic kernels tend to be more pragmatic (e.g. more efficient). The proposed model allows some really interesting hybrid approaches, and allows subsystems to be moved "in process" of the main kernel when efficiency is needed, or pushed "out of process" when they are untrusted or when reliability is paramount, all without rewriting tons of code to achieve it. Swift's focus on stable APIs and API resilience also encourages and enables a split between the core kernel and driver development.

In any case, there is a lot of opportunity to make the software world better, but it is also a long path to carefully design and build each piece in a deliberate and intentional way. Let's take one step at a time, ensuring that each is as good as we can make it.

Learning from other concurrency designs

When designing a concurrency system for Swift, we should look at the designs of other languages to learn from them and ensure we have the best possible system. There are thousands of different programming languages, but most have very small communities, which makes it hard to draw practical lessons out from those communities. Here we look at a few different systems, focusing on how their concurrency design works, ignoring syntactic and other unrelated aspects of their design.

Pony

Perhaps the most relevant active research language is the Pony programming language. It is actor-based and uses them along with other techniques to provide a type-safe, memory-safe, deadlock-free, and datarace-free programming model. The biggest semantic difference between the Pony design and the Swift design is that Pony invests a lot of design complexity into providing reference capabilities, which impose a high learning curve. In contrast, the model proposed here builds on Swift's mature system of value semantics. If transferring object graphs between actors (in a guaranteed memory safe way) becomes important in the future, we can investigate expanding the Swift Ownership Model to cover more of these use-cases.

Akka Actors in Scala

Akka is a framework written in the Scala programming language, whose mission is to "Build powerful reactive, concurrent, and distributed applications more easily". The key to this is their well developed Akka actor system, which is the principle abstraction that developers use to realize these goals (and it, in turn, was heavily influenced by Erlang. One of the great things about Akka is that it is mature and widely used by a lot of different organizations and people. This means we can learn from its design, from the design patterns the community has explored, and from experience reports describing how well it works in practice.

The Akka design shares a lot of similarities to the design proposed here, because it is an implementation of the same actor model. It is built on futures, asynchronous message sends, each actor is a unit of concurrency, there are well-known patterns for when and how actor should communicate, and Akka supports easy distributed computation (which they call "location transparency").

One difference between Akka and the model described here is that Akka is a library feature, not a language feature. This means that it can't provide additional type system and safety features that the model we describe does. For example, it is possible to accidentally share mutable state which leads to bugs and erosion of the model. Their message loops are also manually written loops with pattern matching, instead of being automatically dispatched to actor methods - this leads to somewhat more boilerplate. Akka actor messages are untyped (marshalled through an Any), which can lead to surprising bugs and difficulty reasoning about what the API of an actor is (though the Akka Typed research project is exploring ways to fix this). Beyond that though, the two models are very comparable - and, no, this is not an accident.

Keeping these differences in mind, we can learn a lot about how well the model works in practice, by reading the numerous blog posts and other documents available online, including, for example:

Further, it is likely that some members of the Swift community have encountered this model, it would be great if they share their experiences, both positive and negative.

Go

The Go programming language supports a first-class approach to writing concurrent programs based on goroutines and (bidirectional) channels. This model has been very popular in the Go community and directly reflects many of the core values of the Go language, including simplicity and preference for programming with low levels of abstraction. I have no evidence that this is the case, but I speculate that this model was influenced by the domains that Go thrives in: the Go model of channels and communicating independent goroutines almost directly reflects how servers communicate over network connections (including core operations like select).

The proposed Swift design is higher abstraction than the Go model, but directly reflects one of the most common patterns seen in Go: a goroutine whose body is an infinite loop over a channel, decoding messages to the channel and acting on them. Perhaps the most simple example is this Go code (adapted from this blog post):

func printer(c chan string) {
  for {
    msg := <- c
    fmt.Println(msg)
  }
}

... is basically analogous to this proposed Swift code:

actor Printer {
  actor func print(message: String) {
    print(message)
  }
}

The Swift design is more declarative than the Go code, but doesn't show many advantages or disadvantages in something this small. However, with more realistic examples, the advantages of the higher-level declarative approach show benefit. For example, it is common for goroutines to listen on multiple channels, one for each message they respond to. This example (borrowed from this blog post) is fairly typical:

// Worker represents the worker that executes the job
type Worker struct {
  WorkerPool  chan chan Job
  JobChannel  chan Job
  quit        chan bool
}

func NewWorker(workerPool chan chan Job) Worker {
  return Worker{
    JobChannel: make(chan Job),
    quit:       make(chan bool)}
}

func (w Worker) Start() {
  go func() {
    for {
      select {
      case job := <-w.JobChannel:
        // ...
      case <-w.quit:
        // ...
      }
    }
  }()
}

// Stop signals the worker to stop listening for work requests.
func (w Worker) Stop() {
  go func() {
    w.quit <- true
  }()
}

This sort of thing is much more naturally expressed in our proposal model:

actor Worker {
  actor func do(job: Job) {
    // ...
  }

  actor func stop() {
    // ...
  }
}

That said, there are advantages and other tradeoffs to the Go model as well. Go builds on CSP, which allows more adhoc structures of communication. For example, because goroutines can listen to multiple channels it is occasionally easier to set up some (advanced) communication patterns. Synchronous messages to a channel can only be completely sent if there is something listening and waiting for them, which can lead to performance advantages (and some disadvantages). Go doesn't attempt to provide any sort of memory safety or data isolation, so goroutines have the usual assortment of mutexes and other APIs to use, and are subject to standard bugs like deadlocks and data races. Races can even break memory safety.

I think that the most important thing the Swift community can learn from Go's concurrency model is the huge benefit that comes from a highly scalable runtime model. It is common to have hundreds of thousands or even a million goroutines running around in a server. The ability to stop worrying about "running out of threads" is huge, and is one of the key decisions that contributed to the rise of Go in the cloud.

The other lesson is that (while it is important to have a "best default" solution to reach for in the world of concurrency) we shouldn't overly restrict the patterns that developers are allowed to express. This is a key reason why the async/await design is independent of futures or any other abstraction. A channel library in Swift will be as efficient as the one in Go, and if shared mutable state and channels are the best solution to some specific problem, then we should embrace that fact, not hide from it. That said, I expect these cases to be very rare :-)

Rust

Rust's approach to concurrency builds on the strengths of its ownership system to allow library-based concurrency patterns to be built on top. Rust supports message passing (through channels), but also support locks and other typical abstractions for shared mutable state. Rust's approaches are well suited for systems programmers, which are the primary target audience of Rust.

On the positive side, the Rust design provides a lot of flexibility, a wide range of different concurrency primitives to choose from, and familiar abstractions for C++ programmers.

On the downside, their ownership model has a higher learning curve than the design described here, their abstractions are typically very low level (great for systems programmers, but not as helpful for higher levels), and they don't provide much guidance for programmers about which abstractions to choose, how to structure an application, etc. Rust also doesn't provide an obvious model to scale into distributed applications.

That said, improving synchronization for Swift systems programmers will be a goal once the basics of the Swift Ownership Model come together. When that happens, it makes sense to take another look at the Rust abstractions to see which would make sense to bring over to Swift.

@lattner Another document that could be helpful to take into account possible implementation approach https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/asynceffects-msr-tr-2017-21.pdf

andlabs commented Nov 7, 2017

"The introduction of async/await into the language is a great opportunity to introduce more asynchronous APIs to Cocoa and perhaps even entire new framework extensions (like a revised asynchronous file I/O API)." I'm not sure what new APIs you would need, since we already have I/O via runloop event sources AND dispatch queue objects, both of which already encapsulate the ideas here... (Or in other words, I'm not sure why they would need to be in Cocoa, and not merely Swift-specific.)

Under "This sort of thing is much more naturally expressed in our proposal model:", is it guaranteed that for a given instance of Worker, that its do() and stop() messages will NOT be executed concurrently? Because the Go code that snippet tries to rewrite has that guarantee: the stop signal is sent by another goroutine (which I should also mention are like green threads and are multiplexed onto a limited number of OS threads), and will only be received by the "actor" (Start()) when it finishes processing any work already in progress. You can still have multiple instances that run independently of each other, of course.

In my experience, only providing async-await APIs defies the idea that "The other lesson is that (while it is important to have a "best default" solution to reach for in the world of concurrency) we shouldn't overly restrict the patterns that developers are allowed to express.", but that's just me (and my battle scars from my previous job). (Another view on the same thought.) Over the past week or so, I have come to the realization that a) thinking of things as "synchronous" or "asynchronous" is limiting and that we should probably think of things as a matrix of (on-demand, passive) and (blocking, non-blocking); b) channels (data-based concurrency) and dispatch queues (work-based concurrency) can be isomorphic. I need to write about these sometime...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment