nlguillemot/tasks.md

## tasks.md

      
    Raw
  

              tasks.md
            
          
    In order to effectively use a multi-core CPU, a decomposition of code into tasks must be made.
By decomposing code into tasks, the tasks can be distributed among CPU cores by a scheduler.
Tasks can have dependencies on other tasks, meaning that the tasks execute in a partial order.
This partial order can be represented as a graph, where each task has predecessors and successors.
A task can only execute when all its predecessors have completed.
To implement this, we can explicitly create a graph of tasks, with edges between tasks to identify predecessor/successor relationships.
The problem with a general task graph system is that it's not productive to work with.
Maybe it's okay for some purposes to have a GUI interface to build a task graph, but that's too tedious for everyday code.
Imagine if you had to explicitly create nodes and edges when implementing a single-threaded function call graph, it would be so tedious!
So, how can a productive task graph interface be built?
Languages like C/C++ were originally designed for function decomposition.
That means you split your code into functions, and call functions from other functions. eg:
void a() {
    printf("a");
    b();
}
void b() {
    printf("b");
}
void main() {
    a();
    b();
}

This is the natural way to write C/C++ code, and this is how single-threaded game engines are written.
In the task-parallel world, we want to enhance function decomposition by adding task decomposition.
In other words, we structure the code as tasks that spawn other tasks.
There are benefits of matching task decomposition to function decomposition:

It matches the natural code structure of C/C++ (tasks spawn tasks vs functions call functions)
It provides a clear upgrade path for single-threaded code (child functions become child tasks)
We can leverage C/C++ programmers' built-up intuition on how to write code.

So, what's the difference between tasks and functions?

Functions are called, then return. Tasks are spawned, then complete.
Tasks are a superset of functions. Functions can be implemented by spawning a task and waiting for completion immediately.
Task spawning is more relaxed than function call.
A spawned task may not necessarily begin until its completion is explicitly waited on.
Note: "enqueued" tasks (vs. "spawned") don't need an explicit wait, but still start asynchronously.
Task completion is more relaxed than function return.
A task may not necessarily complete until its completion is explicitly waited on.

The relaxations of tasks allow them to run in parallel, but this relaxation must be constrained to enforce a partial order on execution.
The requirement of partial order comes in 3 main forms, each more general than the last:

Task Path: 1 predecessor, 1 successor.
Task Tree: N predecessors, 1 successor.
Task Directed Acyclic Graph (DAG): N predecessors, N successors.

The Task Path can be implemented as an ordinary sequence of function calls.
This may also be integrated in the task system as a sequence of continuation tasks, where each task defines its own continuation.
The Task Tree is implemented as a task that spawns a group of child tasks, then waits for them to complete.
This is good for recursive divide-and-conquer algorithms, allowing tasks to be created dynamically based on the distribution of work.
The Task DAG is implemented by spawning tasks only when all their predecessors have completed.
This can be implemented by storing a counter of incomplete predecessors in the successor task.
When a predecessor task completes, it decrements the counter in each successor.
The predecessor that decrements the counter to zero then spawns the successor.
These three forms of task relationships can be implemented using Intel TBB's task system.
Note: TBB's task-parallelism enables multi-core scaling through a work-stealing scheduler.
To better understand this, see:

Task-Based Programming
How Task Scheduling Works
CppCon 2015: Pablo Halpern "Work Stealing"

The Task Path is implemented either as an ordinary sequence of function calls (independent of TBB), or using continuation tasks.

See: Continunation Passing
And as an optimization of continuation: Scheduler Bypass

The Task Tree is implemented by spawning child tasks then waiting for them to finish. It's the poster-child use case of TBB tasks.

See: Simple Example: Fibonacci Numbers

The Task DAG is implemented by reusing the reference count built into TBB tasks as the number of incomplete predecessors.
The predecessor tasks must decrement and check the reference count and possibly spawn each successor, as explained previously.

See: General Acyclic Graphs of Tasks

It seems like we have all the building blocks to build task graphs with ordinary TBB tasks, but the API is a bit too low level.
It may be possible to build a simpler interface on top of TBB, that would also simplify common use cases. I would like this API to look closely like an ordinary call graph, for the reasons stated at the start of this writeup. Designing this API is a topic for another post.
PS: TBB is designed for non-blocking tasks, meaning it is not suitable for performing I/O or waiting for asynchronous operating system events. With Intel TBB's task scheduler, a blocking task results in an idle core in the best case, and a deadlock in the worst case. To handle asynchronous I/O and events, I would recommend using the Windows Threadpool task system, since its integration in the operating system helps it make intelligent scheduling decisions by knowing about the state of I/O operations and events. TBB tasks can be executed asynchronously in response to OS events using enqueued tasks.

See: Thread Pool API
And to better understand the Windows Threadpool's implementation: Inside Windows 8: Pedro Teixeira - Thread pools