Skip to content

Instantly share code, notes, and snippets.

@randomPoison
Last active October 26, 2016 04:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save randomPoison/4ef32b8bcbaac04b7511fabd9148675a to your computer and use it in GitHub Desktop.
Save randomPoison/4ef32b8bcbaac04b7511fabd9148675a to your computer and use it in GitHub Desktop.
My notebook

2016-10-25 - Async fibers talking points

  • Demonstrate example first:
    • Initial resource loading is done in parallel. Completely safe AND super easy to use.
    • Multiple behaviors run in parallel each frame. API isn't final, just proof of concept.
  • Go low-level, work our way up through the API.
  • Start with fibers:
    • Basic overview of fibers:
      • Similar to threads. Each has own stack, can be suspended and resumed.
      • For threads, the system manages all threads and suspends and resumes them to give them all time on the CPU. Usually there's a priority system and other ways to specify how to divide that time, but the OS is the only one that can suspend/resume threads.
      • Fibers are run on threads, but you can specify which fiber is active on which thread at any given time.
      • Fibers can be suspended from one thread and resumed on another.
    • Advantages of fibers:
      • Switching fibers on a thread doesn't require a full context switch, and so can be substantially cheaper than when the OS switches threads.
      • We control when the fibers switch, so we can suspend a fiber if it's waiting on other work to complete on another thread. We never need to block.
    • Disadvantages of fibers:
      • Can't delete fibers on Windows, so they can't terminate. Fortunately, Rust's diverging functions are perfect for this.
      • Fibers have strange interactions with Send. It's possible for a fiber to create a !Send type within a fiber, suspend that fiber, then resume it on another thread, violating the !Send requirements for the type. There's no way, as far as I know, to convince the compiler to prevent this from happening.
      • We'll get into the issues around Send unsoundness, and why it's not as big a deal as it seems, later in the talk.
    • Important implementation detail: Resuming a fiber consumes it and returns the fiber that was suspended.
      • This is important for safety. Trying to resume a fiber on one thread that's already running on another is a recipe for disaster. This API design guarantees at compile time that you'll never try to resume a fiber that's actively running.
      • Rust's type system makes this possible by tracking moves at compile time.
  • Scheduler:
    • Scheduler takes arbitrary work, defined as closures, divides it up among worker threads, and allows work to await the result of other asynchronous work. It's the foundation of a fully asynchronous runtime.
    • 2 core primitives: scheduler::start(), and Async::await().
      • scheduler::start() takes a closure, which can return any Send type, schedules that work to be run on one of the worker threads, and returns an Async<T> that acts as a handle to the eventual result of that work.
      • Async::await() suspends the current fiber until the work represented by the Async<T> finishes, returning the value returned by the closure passed into scheduler::start().
    • Together, one unit of work can spawn several other, allow them to run in parallel, then yield its time on the CPU while it waits for them to complete.
    • This roughly approximates the behavior of a yield or await keyword in the language, since asynchronous code now looks like synchronous code.
    • Substantially simpler to use than something like futures-rs. Control flow is handled as it would be anywhere else in the language, no need for combinators or anything like that.
    • Other nifty things that Async<T> can do:
      • Async::await() consumes the Async<T> object, ensuring that you can only get the result of some work once.
      • Async<T> has a lifetime parameter. This allows the closure running the work to borrow data on the spawning stack frame while ensuring that the data doesn't go out of scope. Async implements Drop, which awaits the work if it hasn't yet completed, ensuring that borrowed data remains valid.
      • You can get the underlying work ID from an Async. WorkId doesn't have any lifetime or type parameters, it's simply a unique ID that can be used to await the completion of that unit of work. This allows multiple fibers to await the completion of a single unit of work.
      • Async::forget() allows you to ignore the result of a unit of work without suspending the current fiber. It's only implemented for Async<'static> to preserve soundness.
    • What's next:
      • Refine the API. It'll take some hammering to determine the ideal API for the scheduler.
      • Priorities. There's no way with the current API to prioritize some work units over others. Being able to do so is often a requirement for async code.
  • Unsoundness issues:
    • Forgetting an Async<'a> (non-static Async<T>) is unsound, as it relies on destructors being run in order to avoid use-after-free.
      • Basically the same issue as the scoped threads keruffle, but might need a different solution to be egonomic.
      • Low priority. Basically it's a solved problem, and we can always remove the lifetime parameter from Async and get full safety back.
    • Fibers allow smuggling !Send types between threads.
      • Kind of a big deal.
      • Pretty much impossible to avoid (with current Rust). We can use trait bounds to ensure a closures doesn't capture any !Send data, but we can't prevent it from creating such data then calling Async::await(), which could result in the fiber resuming on a different thread.
      • But maybe not a big deal? Key examples of !Send, such as Rc<T> and &RefCell<T>, are only unsafe if they go to multiple threads at once. E.g. the !Send bound derives from a !Sync bound. In these cases the API provided by the scheduler is completely safe.
      • But some system primitives have strong opinions about what thread(s) they can be used from, e.g. locking a mutex on one thread then unlocking it on another will fail to unlock the mutex, causing any future attempts to acquire the lock to deadlock. Other system APIs can fail or exhibit aberrant behavior when used from the wrong thread.
      • Currently, Rust doesn't have any way to talk about this sort of issue, so it can't detect it at compile time the way we'd like. Something like being able to say "when this this function is called there can't be any !Send data on the stack" would probably do the trick.

2016-10-21

Notes about scheduler

  • Fibers:
    • Being able to consume Fiber when one is resumed is super nice and makes the API way safer.
    • Being able to mark functions as diverging (i.e. never returning) is also super powerful and helps enforce correctness (instead of safety).
      • It would be nice if the compiler could detect diverging functions without putting it in the type signature, but that's a minor gripe.
    • Combining those two means we can provide a completely safe fibers API, which I didn't even think was possible initially.
  • By using Async<'a, T> to represent an async operation it's possible to safely let the async closure borrow data and guarantee at compile time that the borrow won't outlive the data.

2016-09-25

I think the only way to safely move nodes around is to lock them. We can optimize this somewhat by spin-waiting other threads, and I think we can assume at most 2 threads will try to access node data at a time. I expect that the vast majority of access to node data will come from the Transform handle, so there will be no contention for locks. In the purely single-threaded case the lock and memory fences will make performance sub-optimal (and will reduce the amount of optimization the compiler can do), but I think we could provide an escape hatch that allows client code to manually acquire and release the lock once for bulk transformations.

The idea is to do an atomic boolean as a lock the same way we do with AtomicArray, owned by the TransformInner that is shared by both the handle and the node data. Each read/write, whether it is going through the handle or coming from the node being moved, acquires the lock while reading/writing node data. Now that I think about it, having to acquire a lock just to read node data feels heinous, but I don't have a better solution at the moment.

What I'm more concerned about is not being able to deadlock the scene graph. I want to know if it's possible to implement it in such a way that e.g. "one thread acquires the lock for node A and wants B, another acquires B and wants A" can never happen. Let's start with a simple example from earlier:

0: (A, None) (B, None) (C, None)
1:

But now let's say that two threads concurrently try to perform "make A the child of B" and "make B the child of C". Let's say thread_0 is trying to do the former and thread_1 is trying to do the latter.

thread_0's steps:

  1. Acquire lock on A.
  2. Copy A to row 1.
  3. Update A's parent pointer/index to point to B (note that B might have moved since we started).
  4. Update pointer in A's TransformInner.
  5. Release lock on A. (???)
  6. Swap-remove A from row 0.
    1. Acquire Last's lock.
    2. Copy Last to A's old bucket.
    3. Update pointer in Last's TransformInner.
    4. Update parent pointer for all children. (???)
    5. Release lock on Last. (???)
    6. Pop row 0.

thread_1's steps:

  1. Acquire lock on B.
  2. Copy B to row 1.
  3. Update B's parent pointer/index to point to C (note that C might have moved since we started).
  4. Update pointer in B's TransformInner.
  5. Release lock on B. (???)
  6. Swap-remove B from row 0.
    1. Acquire Last's lock.
    2. Copy Last to B's old bucket.
    3. Update pointer in Last's TransformInner.
    4. Update parent pointer for all children. (???)
    5. Release lock on Last. (???)
    6. Pop row 0.

Note that the steps as listed aren't necessarily in optimal order. A few of the steps are marked with (???), indicating that I'm either not sure how to do that operation or not sure if that operation is even safe to do. When moving a node down to its new row (ugh, I didn't even think of the case where you move a node up, though I don't know if it matters), at what point is it safe to release the lock on that node? Once the node has been moved to it's new row and all of its pointers have been fixed up we're in theory done with that node. But what if another thread wants to then come in an move the node again before the rest of the cleanup happens? If we wait until all of the work for that node is done, including cleanup that doesn't touch that node's data, to release the lock then we can be more sure that the node's data is in a correct state before another thread can observe it.

Let's look at thread_0's steps in isolation first and vet them for correctness, then we can worry about deadlocks. First thing to note is that there's one more step that wasn't originally listed: Updating any child nodes to point to the correct location. I didn't list it because the example above doesn't give A any children, but in the general case thread_0 will have to check when doing the update.

That actually brings us to the general question of how to we move entire hierarchies of nodes? In theory a node hierarchy can be arbitrarily deep, so moving one node up or down may involve moving dozens or hundreds of nodes as well. Once concurrency is thrown in then we have a minor disaster on our hands.

When moving nodes down to a lower row, it probably makes sense to start at the deepest part of the hierarchy and work up to the root. This should minimize. the number of times we need to move a node while making the transfer. To see why let's look at an example case when we go root-down rather than the other way around:

0: (A, None) (B, None)
1: (C, 1)

To make B the child of A, we'd first move B down to row 1:

0: (A, None)
1: (C, ???), (B, 0)

Then we'd move C down to row 2:

0: (A, None)
1: (B, 0)
2: (C, 0)

Note that moving C required that we move B again. If we start by moving C first before B then we only have to move each node once:

0: (A, None) (B, None)
1: (C, 1)

0: (A, None) (B, None)
1:
2: (C, ???)

0: (A, None)
1: (B, 0)
2: (C, 0)

Writing out that diagram just now made me realize that while a parent node is moving it's child node is actually in an invalid state, because the child node is pointing to the wrong parent node. If we were to allow mutations to a node to access the parent node at any time (e.g. to allow transformations in world space) then we would have to acquire locks on all nodes in the sub-tree being moved, but we've opted to disallow such parent/child interaction, making this a non-issue for now.

2016-09-24

Actually, could we solve that with CAS again? It worked for popping because in that case we don't try to mutate the shared buffer, we only copy data out of the buffer and then confirm we copied the right data out, so there's no chance of data races. For pushing we need to ensure that only one thread has write access to the new bucket while the push is happening, so waiting to confirm until after writing means we could have already clobbered another thread's read/write. Incrementing the count before writing will guarantee that a concurrent push doesn't try to write into the same bucket, but it would allow another thread looking at the buffer to observe to partially-copied element.

A pop followed by a push isn't an issue because the pop is only a single atomic operation, so the subsequent push is free to immediately start writing to the newly-empty bucket.

A push followed by a pop is again quite complicated. If we increment-then-write when pushing, then the subsequent pop may try to copy an uninitialized element. If we write-then-increment, then we run back into the possibility of multiple threads trying to concurrently push into the same bucket.

After looking at some resources regarding lock-free data structures, the most useful of which being a paper on lock-free vectors for C++, it seems to me like it's not possible to get a completely lock-free array. At the very least we can avoid locking access to all elements while pushing and popping, but pushes and pops will likely have to be serialized in some way. I think that's acceptable for my purposes because I was already operating under the assumption that pushes and pops would be far less frequent than normal access to the elements of the vec.

I think the easiest starting point would be to use an atomic boolean flag as a "write lock", and have all threads trying to push/pop spin-wait until they acquire the lock.

The new problem is now with the scene graph and being able to concurrently move nodes around, e.g. for changing parent/child relationships. Creating nodes is easy since it only involves pushing a node onto the first row, and we've already handled concurrent pushes. But removing a node requires a swap-remove operation in order to maintain that the remaining elements are contiguous. Looking at our atomic array, swap remove is already somewhat difficult to manage (though using the same locking method may be sufficient). On top of that we also have to consider potentially concurrent access to the nodes while they're being moved. Such access would be hazardous, so managing it properly is important.

Let's take a look at the process of making one node the child of another. Let's say we have the following graph, with nodes be represented as (node, parent index):

0: (A, None) (B, None) (C, None)
1:

We want to make A the child of B, which means we must remove it from row 0 and add it to row 1, and not allow any thread to observe invalid data. The first step is to remove A from row 0. In order to avoid holes we need to swap the last element in the array, in this case C, into A's old bucket. The end result should look like this:

0: (C, None) (B, None)
1: (A, 1)

The problem is we have to update the Transform handles for A and C, and we have to do so in a way that avoids data races. I think the first step is actually to copy A to row 1. Once that's done and we've set it up so that other threads won't attempt to observe, then we only have to copy C into A's old bucket and then decrement the count, we never need to swap A into C's old bucket.

So the question now is how to move a node in memory safely.

2016-09-23

Atomic stable Vec:

  • Uses atomic counters internally to make it thread-safe.
  • Doesn't move elements in memory when it reallocates.
  • Safe to push/pop elements concurrently.
  • Safe to swap elements concurrently.

That last one is where things really get difficult. Let's look each mutating operation and see what it takes to do it concurrently.

Pushing elements is fairly easy, at least when we're only talking about pushes. Simply increment the count before moving the element into the buffer and subsequent pushes will be coherent.

Pops are a bit more complicated because they require synchronization in order to be sound. At the very least if we're cool with throwing away the popped data then it's fine to atomically decrement the count and leave it at that, no synchronization needed. Unfortunately, we'll almost always want to return the popped element, and in the case of a swap-remove we need to both move the last element and the removed element. In this case we must ensure that another thread doesn't try to push a new element onto the end of the stack while we're copying the popped element off or we wind up with a data race and partially clobbered data.

The easy solution is to lock the vec while pushing. Since we only need to hold the lock while copying an element out of the buffer we can we can spin-lock threads that are waiting to push/pop other elements. We expect that the lock should only be held for a few cycles so this should be the most efficient of any locking strategy. Also, it may be safe to do multiple pops concurrently, as long as no pushes are attempted until all pops fail. It may also be safe to access other elements of the vec while the lock is held as long as the count is decremented before the copy begins.

Another theoretical solution is to use compare-and-swap semantics to avoid having to lock the buffer at all, though at this point I'm still kicking the idea around and am not 100% sure it even makes sense. The idea is that decrementing the count is atomic and safe to do if you don't try to access the popped element's data in the buffer after doing so (since at any point after decrementing the counter another thread could swoop in and start writing to that memory). So we instead copy our popped element out of the buffer, then make sure the count is still the same before decrementing. If it's not then we're trying to pop the wrong element and have to try again, this time attempting to pop the new last element. This repeats until we successfully pop an element.

The problem is that I don't fully understand all the implications at this point. In no particular order:

  • Does it always make sense to pop the new last element? In some cases the client code may be trying to pop a specific element, so if another thread pops it first the client code might rather just be told that the pop failed instead of trying to pop a different element.
  • What about swap-remove? We have no way of preventing access to the element being removed while it's being overwritten in the buffer, so... I dunno. The whole data structure is super unsafe with regards to parallel mutation since we have no way of enforcing mutually exclusive borrows. Otherwise I think swap-remove should just work if compare-and-swap works for popping.

Let's look at each of the possible mutations that can be made and see what the implications of each strategy would be:

  • Push -> push
  • Pop -> pop
  • Push -> pop
  • Pop -> push

And then also swap-remove is in there, though I don't know if it's functionally different than a swap followed by a pop (and we're not bothering to synchronize swaps, client code has to) manage that.

As an aside, I initially listed the requirement of "stable" and "atomic", with a "stable" vec basically being a linked list of fixed sized buffers. But it occurs to me that this could be achieved with something like LinkedList<Array<T>> instead of StableArray<T>, so my focus is now on some reasonable implementation of a lock-free fixed size array.

Anyway, a pop followed by a pop is trivial to make thread-safe because a pop is simply decrementing an atomic counter. Note that this ignores destructors and copying the popped element out of the array. Ultimately running the destructor is less important than copying the element out of the buffer; We can always mem::forget() the element in the buffer and then drop the copied element to ensure that destructors can be run without impacting concurrency. As such what matters is that we be able to copy the last element in the buffer onto the stack before decrementing the count. This is where the compare-and-swap semantics come in: We read the current count, copy the last element, then compare-and-swap to decrement the count. If the compare fails then we mem::forget() the copied element and then either try again or return an error (though that decision is one of ergonomics, not safety, and likely both will be provided as different methods).

A push followed by another push is similarly easy. The actual "push" itself is done by incrementing the atomic counter, so if we increment before copying the element into the buffer, then the subsequent push will not try to clobber the previous in-progress one. The only problem is that incrementing the counter before copying the data in means that it's possible another thread will then try to observe that bucket in some way (i.e. read or write the element) before it has been fully copied into the buffer. Incrementing the counter after doing the copy prevents that issue, but reintroduces the problem of multiple threads trying to push an element into the same buffer at once.

2016-09-22

Transforms form a scene graph. Transforms allow you to get their parent and children. This allows you do garbage like:

fn super_unsafe(transform: &mut Transform) {
    let alias = &mut transform.parent().children[0]; // `alias` now aliases `transform`, which is evil.
}

To avoid evil, we have to change something:

  • Transforms don't allow access to parent/children.
    • This basically makes the whole "scene graph" concept useless to game developers.
  • Transforms don't provide mutable access, they're more like RefCell.
    • This is an ergonomics and performance hazard, since refcell both adds runtime overhead to every transform and it makes your code crashy if you try to double-borrow transforms.
    • Managing this in a parallel context is even harder, as doing a mutex-like and blocking if the transform is already borrowed will deadlock if the same transform is borrowed twice in a single thread.

Currently my vote is to take the first approach -- remove the ability to query the scene graph. My (admittedly limited) experience with game development leads me to believe that it's fairly rare to need the ability to make general queries of the scene graph. In most cases, the developer knows a priori what the structure of the scene will be and only needs the ability to mutate a few local nodes in the graph. If that is the case, then having Transform represent ownership of the node and provide mutable access to its data would be more useful.

This does raise the question of how the scene graph is initially constructed. In the simplest case where the graph is manually constructed programatically:

    fn make_simple_graph() {
        let parent = Transform::new();
        let child = Transform::with_parent(&mut parent);

        // Now we know the structure of our scene graph and have safe, mutable access to all (both) of
        // the nodes in the graph.
        parent.translate((1.0, 1.0, 1.0));
        child.set_scale((0.5, 10.0, 0.5));
    }

it works well enough because you can get each Transform node as the graph is being constructed. But in most cases the developer will want to automatically construct the scene at startup by using some asset file that describes the layout of the scene. In this case there has to be some way to initially deliver the scene nodes to game code. If this is done by querying the scene graph then that kind of defeats the purpose.

This might be a good place for a deserialization subsystem to come in handy. When you load the scene you also deserialize the objects that are supposed to own the scene nodes, and you create them already owning the nodes. This removes the need to query the scene graph, but requires that serializing game state and serializing the scene graph/level layout be tightly coupled. This is the way things work in Unity (to an extent).

Here's a question: What happens when we translate/rotate/scale a transform or change parent/child relationships? In terms of the actual Transform objects it won't violate safety because it can't break references, right? But behind the scenes we do have to mutate the data held by the scene graph. If we're just localy translating a transform only its data needs to be mutated. If we want to translate it in world space it needs to reference it's parent's position. Things get more complicated if the parent's world position is out of date (assuming we even allow that to happen).

Scene graph mutation concerns:

  • Changing parent/child relationship involves moving transform data to different rows, which invalidated multiple pointers and is crazy unsafe in a multi-threaded context.
  • Changing a transform's local position requires that we recalculate world position for both it and its children.
  • Changing a transform's world attributes requires that its parent's world attributes be accurate. If not world position will be wrong.
  • How do we synchronize access to transform data? The transform ownership model guarantees that local changes will always be safe to do because there will only be one piece of code able to mutable access a node at a time. But world transformations require reading from the parent, and changing parent/child relationship potentially involves moving a whole hierarchy of nodes around in memory, which is a massive data race hazard.

Temporary solutions:

  • Ban world-space transformations. If everything is done in local space then there's no synchronization around translating/rotating/scaling transforms.
  • Read-write lock the entire graph (or at least all affected rows) when changing parent/child relationships. Treat all local writes as reads and wait for all to complete before attempting to move nodes in the scene.

This still allows for creation/destruction of nodes, full local transformation, and modifying parent/child relationships, all while keeping synchronization at a minimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment