randomPoison/scratch.md

## scratch.md

      
    Raw
  

              scratch.md
            
          
    2016-10-25 - Async fibers talking points


Demonstrate example first:

Initial resource loading is done in parallel. Completely safe AND super easy to use.
Multiple behaviors run in parallel each frame. API isn't final, just proof of concept.


Go low-level, work our way up through the API.
Start with fibers:

Basic overview of fibers:

Similar to threads. Each has own stack, can be suspended and resumed.
For threads, the system manages all threads and suspends and resumes them to give them all
time on the CPU. Usually there's a priority system and other ways to specify how to divide
that time, but the OS is the only one that can suspend/resume threads.
Fibers are run on threads, but you can specify which fiber is active on which thread at
any given time.
Fibers can be suspended from one thread and resumed on another.


Advantages of fibers:

Switching fibers on a thread doesn't require a full context switch, and so can be
substantially cheaper than when the OS switches threads.
We control when the fibers switch, so we can suspend a fiber if it's waiting on other
work to complete on another thread. We never need to block.


Disadvantages of fibers:

Can't delete fibers on Windows, so they can't terminate. Fortunately, Rust's diverging
functions are perfect for this.
Fibers have strange interactions with Send. It's possible for a fiber to create a
!Send type within a fiber, suspend that fiber, then resume it on another thread,
violating the !Send requirements for the type. There's no way, as far as I know, to
convince the compiler to prevent this from happening.
We'll get into the issues around Send unsoundness, and why it's not as big a deal as
it seems, later in the talk.


Important implementation detail: Resuming a fiber consumes it and returns the fiber that was
suspended.

This is important for safety. Trying to resume a fiber on one thread that's already
running on another is a recipe for disaster. This API design guarantees at compile time
that you'll never try to resume a fiber that's actively running.
Rust's type system makes this possible by tracking moves at compile time.


Scheduler:

Scheduler takes arbitrary work, defined as closures, divides it up among worker threads,
and allows work to await the result of other asynchronous work. It's the foundation of a fully
asynchronous runtime.
2 core primitives: scheduler::start(), and Async::await().

scheduler::start() takes a closure, which can return any Send type, schedules that
work to be run on one of the worker threads, and returns an Async<T> that acts as a
handle to the eventual result of that work.
Async::await() suspends the current fiber until the work represented by the Async<T>
finishes, returning the value returned by the closure passed into scheduler::start().


Together, one unit of work can spawn several other, allow them to run in parallel,
then yield its time on the CPU while it waits for them to complete.
This roughly approximates the behavior of a yield or await keyword in the language,
since asynchronous code now looks like synchronous code.
Substantially simpler to use than something like futures-rs. Control flow is handled as it
would be anywhere else in the language, no need for combinators or anything like that.
Other nifty things that Async<T> can do:

Async::await() consumes the Async<T> object, ensuring that you can only get the result
of some work once.
Async<T> has a lifetime parameter. This allows the closure running the work to borrow
data on the spawning stack frame while ensuring that the data doesn't go out of scope.
Async implements Drop, which awaits the work if it hasn't yet completed, ensuring
that borrowed data remains valid.
You can get the underlying work ID from an Async. WorkId doesn't have any lifetime or
type parameters, it's simply a unique ID that can be used to await the completion of that
unit of work. This allows multiple fibers to await the completion of a single unit of work.
Async::forget() allows you to ignore the result of a unit of work without suspending
the current fiber. It's only implemented for Async<'static> to preserve soundness.


What's next:

Refine the API. It'll take some hammering to determine the ideal API for the scheduler.
Priorities. There's no way with the current API to prioritize some work units over
others. Being able to do so is often a requirement for async code.


Unsoundness issues:

Forgetting an Async<'a> (non-static Async<T>) is unsound, as it relies on destructors
being run in order to avoid use-after-free.

Basically the same issue as the scoped threads keruffle, but might need a different solution
to be egonomic.
Low priority. Basically it's a solved problem, and we can always remove the lifetime parameter
from Async and get full safety back.


Fibers allow smuggling !Send types between threads.

Kind of a big deal.
Pretty much impossible to avoid (with current Rust). We can use trait bounds to ensure a closures doesn't
capture any !Send data, but we can't prevent it from creating such data then calling
Async::await(), which could result in the fiber resuming on a different thread.
But maybe not a big deal? Key examples of !Send, such as Rc<T> and &RefCell<T>, are
only unsafe if they go to multiple threads at once. E.g. the !Send bound derives from a
!Sync bound. In these cases the API provided by the scheduler is completely safe.
But some system primitives have strong opinions about what thread(s) they can be used
from, e.g. locking a mutex on one thread then unlocking it on another will fail to unlock
the mutex, causing any future attempts to acquire the lock to deadlock. Other system APIs
can fail or exhibit aberrant behavior when used from the wrong thread.
Currently, Rust doesn't have any way to talk about this sort of issue, so it can't detect
it at compile time the way we'd like. Something like being able to say "when this
this function is called there can't be any !Send data on the stack" would probably do
the trick.


2016-10-21

Notes about scheduler


Fibers:

Being able to consume Fiber when one is resumed is super nice and makes the API way safer.
Being able to mark functions as diverging (i.e. never returning) is also super powerful and helps
enforce correctness (instead of safety).

It would be nice if the compiler could detect diverging functions without putting it in the
type signature, but that's a minor gripe.


Combining those two means we can provide a completely safe fibers API, which I didn't even
think was possible initially.


By using Async<'a, T> to represent an async operation it's possible to safely let the async
closure borrow data and guarantee at compile time that the borrow won't outlive the data.

2016-09-25

I think the only way to safely move nodes around is to lock them. We can optimize this somewhat by
spin-waiting other threads, and I think we can assume at most 2 threads will try to access node
data at a time. I expect that the vast majority of access to node data will come from the
Transform handle, so there will be no contention for locks. In the purely single-threaded case
the lock and memory fences will make performance sub-optimal (and will reduce the amount of
optimization the compiler can do), but I think we could provide an escape hatch that allows client
code to manually acquire and release the lock once for bulk transformations.
The idea is to do an atomic boolean as a lock the same way we do with AtomicArray, owned by the
TransformInner that is shared by both the handle and the node data. Each read/write, whether it
is going through the handle or coming from the node being moved, acquires the lock while
reading/writing node data. Now that I think about it, having to acquire a lock just to read node
data feels heinous, but I don't have a better solution at the moment.
What I'm more concerned about is not being able to deadlock the scene graph. I want to know if it's
possible to implement it in such a way that e.g. "one thread acquires the lock for node A and wants B,
another acquires B and wants A" can never happen. Let's start with a simple example from earlier:
0: (A, None) (B, None) (C, None)
1:

But now let's say that two threads concurrently try to perform "make A the child of B" and "make B
the child of C". Let's say thread_0 is trying to do the former and thread_1 is trying to do the
latter.
thread_0's steps:

Acquire lock on A.
Copy A to row 1.
Update A's parent pointer/index to point to B (note that B might have moved since we started).
Update pointer in A's TransformInner.
Release lock on A. (???)
Swap-remove A from row 0.

Acquire Last's lock.
Copy Last to A's old bucket.
Update pointer in Last's TransformInner.
Update parent pointer for all children. (???)
Release lock on Last. (???)
Pop row 0.


thread_1's steps:

Acquire lock on B.
Copy B to row 1.
Update B's parent pointer/index to point to C (note that C might have moved since we started).
Update pointer in B's TransformInner.
Release lock on B. (???)
Swap-remove B from row 0.

Acquire Last's lock.
Copy Last to B's old bucket.
Update pointer in Last's TransformInner.
Update parent pointer for all children. (???)
Release lock on Last. (???)
Pop row 0.


Note that the steps as listed aren't necessarily in optimal order. A few of the steps are marked
with (???), indicating that I'm either not sure how to do that operation or not sure if that
operation is even safe to do. When moving a node down to its new row (ugh, I didn't even think of
the case where you move a node up, though I don't know if it matters), at what point is it safe
to release the lock on that node? Once the node has been moved to it's new row and all of its
pointers have been fixed up we're in theory done with that node. But what if another thread wants
to then come in an move the node again before the rest of the cleanup happens? If we wait until all
of the work for that node is done, including cleanup that doesn't touch that node's data, to release
the lock then we can be more sure that the node's data is in a correct state before another thread
can observe it.
Let's look at thread_0's steps in isolation first and vet them for correctness, then we can worry
about deadlocks. First thing to note is that there's one more step that wasn't originally listed:
Updating any child nodes to point to the correct location. I didn't list it because the example
above doesn't give A any children, but in the general case thread_0 will have to check when doing
the update.
That actually brings us to the general question of how to we move entire hierarchies of nodes? In
theory a node hierarchy can be arbitrarily deep, so moving one node up or down may involve moving
dozens or hundreds of nodes as well. Once concurrency is thrown in then we have a minor disaster on
our hands.
When moving nodes down to a lower row, it probably makes sense to start at the deepest part of the
hierarchy and work up to the root. This should minimize. the number of times we need to move a node
while making the transfer. To see why let's look at an example case when we go root-down rather than
the other way around:
0: (A, None) (B, None)
1: (C, 1)

To make B the child of A, we'd first move B down to row 1:
0: (A, None)
1: (C, ???), (B, 0)

Then we'd move C down to row 2:
0: (A, None)
1: (B, 0)
2: (C, 0)

Note that moving C required that we move B again. If we start by moving C first before B then we
only have to move each node once:
0: (A, None) (B, None)
1: (C, 1)

0: (A, None) (B, None)
1:
2: (C, ???)

0: (A, None)
1: (B, 0)
2: (C, 0)

Writing out that diagram just now made me realize that while a parent node is moving it's child node
is actually in an invalid state, because the child node is pointing to the wrong parent node. If we
were to allow mutations to a node to access the parent node at any time (e.g. to allow
transformations in world space) then we would have to acquire locks on all nodes in the sub-tree
being moved, but we've opted to disallow such parent/child interaction, making this a non-issue
for now.
2016-09-24

Actually, could we solve that with CAS again? It worked for popping because in that case we don't
try to mutate the shared buffer, we only copy data out of the buffer and then confirm we copied the
right data out, so there's no chance of data races. For pushing we need to ensure that only one
thread has write access to the new bucket while the push is happening, so waiting to confirm until
after writing means we could have already clobbered another thread's read/write. Incrementing the
count before writing will guarantee that a concurrent push doesn't try to write into the same
bucket, but it would allow another thread looking at the buffer to observe to partially-copied
element.
A pop followed by a push isn't an issue because the pop is only a single atomic operation, so the
subsequent push is free to immediately start writing to the newly-empty bucket.
A push followed by a pop is again quite complicated. If we increment-then-write when pushing, then
the subsequent pop may try to copy an uninitialized element. If we write-then-increment, then we
run back into the possibility of multiple threads trying to concurrently push into the same bucket.
After looking at some resources regarding lock-free data structures, the most useful of which being
a paper on lock-free vectors for C++, it seems to me like it's not possible
to get a completely lock-free array. At the very least we can avoid locking access to all elements
while pushing and popping, but pushes and pops will likely have to be serialized in some way. I
think that's acceptable for my purposes because I was already operating under the assumption that
pushes and pops would be far less frequent than normal access to the elements of the vec.
I think the easiest starting point would be to use an atomic boolean flag as a "write lock", and
have all threads trying to push/pop spin-wait until they acquire the lock.
The new problem is now with the scene graph and being able to concurrently move nodes around,
e.g. for changing parent/child relationships. Creating nodes is easy since it only involves pushing
a node onto the first row, and we've already handled concurrent pushes. But removing a node
requires a swap-remove operation in order to maintain that the remaining elements are contiguous.
Looking at our atomic array, swap remove is already somewhat difficult to manage (though using the
same locking method may be sufficient). On top of that we also have to consider potentially
concurrent access to the nodes while they're being moved. Such access would be hazardous, so
managing it properly is important.
Let's take a look at the process of making one node the child of another. Let's say we have the
following graph, with nodes be represented as (node, parent index):
0: (A, None) (B, None) (C, None)
1:

We want to make A the child of B, which means we must remove it from row 0 and add it to row 1, and
not allow any thread to observe invalid data. The first step is to remove A from row 0. In order to
avoid holes we need to swap the last element in the array, in this case C, into A's old bucket. The
end result should look like this:
0: (C, None) (B, None)
1: (A, 1)

The problem is we have to update the Transform handles for A and C, and we have to do so in a way
that avoids data races. I think the first step is actually to copy A to row 1. Once that's done and
we've set it up so that other threads won't attempt to observe, then we only have to copy C into A's
old bucket and then decrement the count, we never need to swap A into C's old bucket.
So the question now is how to move a node in memory safely.
2016-09-23

Atomic stable Vec:

Uses atomic counters internally to make it thread-safe.
Doesn't move elements in memory when it reallocates.
Safe to push/pop elements concurrently.
Safe to swap elements concurrently.

That last one is where things really get difficult. Let's look each mutating operation and see
what it takes to do it concurrently.
Pushing elements is fairly easy, at least when we're only talking about pushes. Simply
increment the count before moving the element into the buffer and subsequent pushes will
be coherent.
Pops are a bit more complicated because they require synchronization in order to be sound.
At the very least if we're cool with throwing away the popped data then it's fine to atomically
decrement the count and leave it at that, no synchronization needed. Unfortunately, we'll almost
always want to return the popped element, and in the case of a swap-remove we need to both
move the last element and the removed element. In this case we must ensure that another thread
doesn't try to push a new element onto the end of the stack while we're copying the popped
element off or we wind up with a data race and partially clobbered data.
The easy solution is to lock the vec while pushing. Since we only need to hold the lock while
copying an element out of the buffer we can we can spin-lock threads that are waiting to
push/pop other elements. We expect that the lock should only be held for a few cycles so this
should be the most efficient of any locking strategy. Also, it may be safe to do multiple pops
concurrently, as long as no pushes are attempted until all pops fail. It may also be safe to
access other elements of the vec while the lock is held as long as the count is decremented
before the copy begins.
Another theoretical solution is to use compare-and-swap semantics to avoid having to lock the
buffer at all, though at this point I'm still kicking the idea around and am not 100% sure it
even makes sense. The idea is that decrementing the count is atomic and safe to do if you don't
try to access the popped element's data in the buffer after doing so (since at any point after
decrementing the counter another thread could swoop in and start writing to that memory). So we
instead copy our popped element out of the buffer, then make sure the count is still the same
before decrementing. If it's not then we're trying to pop the wrong element and have to try
again, this time attempting to pop the new last element. This repeats until we successfully pop
an element.
The problem is that I don't fully understand all the implications at this point. In no
particular order:

Does it always make sense to pop the new last element? In some cases the client code may be
trying to pop a specific element, so if another thread pops it first the client code might
rather just be told that the pop failed instead of trying to pop a different element.
What about swap-remove? We have no way of preventing access to the element being removed
while it's being overwritten in the buffer, so... I dunno. The whole data structure is super
unsafe with regards to parallel mutation since we have no way of enforcing mutually exclusive
borrows. Otherwise I think swap-remove should just work if compare-and-swap works for popping.

Let's look at each of the possible mutations that can be made and see what the implications of
each strategy would be:

Push -> push
Pop -> pop
Push -> pop
Pop -> push

And then also swap-remove is in there, though I don't know if it's functionally different than
a swap followed by a pop (and we're not bothering to synchronize swaps, client code has to)
manage that.
As an aside, I initially listed the requirement of "stable" and "atomic", with a "stable" vec
basically being a linked list of fixed sized buffers. But it occurs to me that this could be
achieved with something like LinkedList<Array<T>> instead of StableArray<T>, so my focus is now
on some reasonable implementation of a lock-free fixed size array.
Anyway, a pop followed by a pop is trivial to make thread-safe because a pop is simply decrementing
an atomic counter. Note that this ignores destructors and copying the popped element out of the
array. Ultimately running the destructor is less important than copying the element out of the
buffer; We can always mem::forget() the element in the buffer and then drop the copied element to
ensure that destructors can be run without impacting concurrency. As such what matters is that we
be able to copy the last element in the buffer onto the stack before decrementing the count. This
is where the compare-and-swap semantics come in: We read the current count, copy the last element,
then compare-and-swap to decrement the count. If the compare fails then we mem::forget() the
copied element and then either try again or return an error (though that decision is one of ergonomics,
not safety, and likely both will be provided as different methods).
A push followed by another push is similarly easy. The actual "push" itself is done by incrementing
the atomic counter, so if we increment before copying the element into the buffer, then the
subsequent push will not try to clobber the previous in-progress one. The only problem is that
incrementing the counter before copying the data in means that it's possible another thread will
then try to observe that bucket in some way (i.e. read or write the element) before it has been
fully copied into the buffer. Incrementing the counter after doing the copy prevents that issue, but
reintroduces the problem of multiple threads trying to push an element into the same buffer at once.
2016-09-22

Transforms form a scene graph.
Transforms allow you to get their parent and children.
This allows you do garbage like:
fn super_unsafe(transform: &mut Transform) {
    let alias = &mut transform.parent().children[0]; // `alias` now aliases `transform`, which is evil.
}
To avoid evil, we have to change something:

Transforms don't allow access to parent/children.

This basically makes the whole "scene graph" concept useless to game developers.


Transforms don't provide mutable access, they're more like RefCell.

This is an ergonomics and performance hazard, since refcell both adds runtime overhead to
every transform and it makes your code crashy if you try to double-borrow transforms.
Managing this in a parallel context is even harder, as doing a mutex-like and blocking
if the transform is already borrowed will deadlock if the same transform is borrowed twice
in a single thread.


Currently my vote is to take the first approach -- remove the ability to query the scene graph.
My (admittedly limited) experience with game development leads me to believe that it's fairly
rare to need the ability to make general queries of the scene graph. In most cases, the developer
knows a priori what the structure of the scene will be and only needs the ability to mutate a few
local nodes in the graph. If that is the case, then having Transform represent ownership of the
node and provide mutable access to its data would be more useful.
This does raise the question of how the scene graph is initially constructed. In the simplest
case where the graph is manually constructed programatically:
    fn make_simple_graph() {
        let parent = Transform::new();
        let child = Transform::with_parent(&mut parent);

        // Now we know the structure of our scene graph and have safe, mutable access to all (both) of
        // the nodes in the graph.
        parent.translate((1.0, 1.0, 1.0));
        child.set_scale((0.5, 10.0, 0.5));
    }
it works well enough because you can get each Transform node as the graph is being constructed.
But in most cases the developer will want to automatically construct the scene at startup by
using some asset file that describes the layout of the scene. In this case there has to be some
way to initially deliver the scene nodes to game code. If this is done by querying the scene
graph then that kind of defeats the purpose.
This might be a good place for a deserialization subsystem to come in handy. When you load the
scene you also deserialize the objects that are supposed to own the scene nodes, and you create
them already owning the nodes. This removes the need to query the scene graph, but requires
that serializing game state and serializing the scene graph/level layout be tightly coupled. This
is the way things work in Unity (to an extent).
Here's a question: What happens when we translate/rotate/scale a transform or change parent/child
relationships? In terms of the actual Transform objects it won't violate safety because it
can't break references, right? But behind the scenes we do have to mutate the data held by the
scene graph. If we're just localy translating a transform only its data needs to be mutated. If
we want to translate it in world space it needs to reference it's parent's position. Things get
more complicated if the parent's world position is out of date (assuming we even allow that to
happen).
Scene graph mutation concerns:

Changing parent/child relationship involves moving transform data to different rows, which
invalidated multiple pointers and is crazy unsafe in a multi-threaded context.
Changing a transform's local position requires that we recalculate world position for both it
and its children.
Changing a transform's world attributes requires that its parent's world attributes be
accurate. If not world position will be wrong.
How do we synchronize access to transform data? The transform ownership model guarantees that
local changes will always be safe to do because there will only be one piece of code able to
mutable access a node at a time. But world transformations require reading from the parent, and
changing parent/child relationship potentially involves moving a whole hierarchy of nodes
around in memory, which is a massive data race hazard.

Temporary solutions:

Ban world-space transformations. If everything is done in local space then there's no
synchronization around translating/rotating/scaling transforms.
Read-write lock the entire graph (or at least all affected rows) when changing parent/child
relationships. Treat all local writes as reads and wait for all to complete before attempting
to move nodes in the scene.

This still allows for creation/destruction of nodes, full local transformation, and modifying
parent/child relationships, all while keeping synchronization at a minimum.