How Task Scheduling Works

The scheduler evaluates a task graph. The graph is a directed graph where each node is a task. Each task points to its successor, which is another task that is waiting on it to complete, or NULL. Method task::parent() gives you read-only access to the successor pointer. Each task has a refcount that counts the number of tasks that have it as a successor. The graph evolves over time.

Snapshot of Task Graph for the Fibonacci Example

The figure above shows a snapshot of a task graph for the Fibonacci example where:

  • Tasks A, B, and C spawned child tasks that they are waiting upon. Their refcount values are the number of children in flight plus one. The extra one is part of a convention for explicit waiting that is explained later in this section.

  • Task D is running, but has not yet spawned any children, and so it has not had to set its reference count yet.

  • Tasks E, F, and G have been spawned, but have not yet started executing.

The scheduler runs tasks in a way that tends to minimize both memory demands and cross-thread communication. The intuition is that a balance must be reached between depth-first and breadth-first execution. Assuming that the tree is finite, depth-first is best for sequential execution for the following reasons:

  • Strike when the cache is hot. The deepest tasks are the most recently created tasks, and therefore are hottest in cache. Also, if they can complete, then task C can continue executing, and though not the hottest in cache, it is still warmer than the older tasks above it.

  • Minimize space. Executing the shallowest task leads to breadth-first unfolding of the tree. This creates an exponential number of nodes that coexist simultaneously. In contrast, depth-first execution creates the same number of nodes, but only a linear number have to exist at the same time, because it stacks the other ready tasks (E, F, and G in the picture).

Though breadth-first execution has a severe problem with memory consumption, it does maximize parallelism if you have an unlimited number of physical threads. Since physical threads are limited, it is better to use only enough breadth-first execution to keep the available processors busy.

The scheduler implements a hybrid of depth-first and breadth-first execution. Each thread has its own deque[8] of tasks that are ready to run. When a thread spawns a task, it pushes it onto the bottom of its deque. The following figure shows a snapshot of a thread's deque that corresponds to the task graph in figure above.

A Thread's Deque

When a thread participates in task graph evaluation, it continually executes a task obtained by the first rule below that applies:

  1. Get the task returned by method execute for the previous task. This rule does not apply if execute returned NULL.

  2. Pop a task from the bottom of its own deque. This rule does not apply if the deque is empty.

  3. Steal a task from the top of another randomly chosen deque. If the chosen deque is empty, the thread tries this rule again until it succeeds.

Rule 1 is discussed in Scheduler Bypass. The overall effect of rule 2 is to execute the youngest task spawned by the thread, which causes depth-first execution until the thread runs out of work. Then rule 3 applies. It steals the oldest task spawned by another thread, which causes temporary breadth-first execution that converts potential parallelism into actual parallelism.

Getting a task is always automatic; it happens as part of task graph evaluation. Putting a task into a deque can be explicit or implicit. A thread always pushes a task onto the bottom of its own deque, never another thread's deque. Only theft can transfer a task spawned by one thread to another thread.

There are three conditions that cause a thread to push a task onto its deque:

  • The task is explicitly spawned by the thread, for example, by method spawn.

  • A task has been marked for re-execution by method task::recycle_to_reexecute.

  • The thread completes execution of the last predecessor task and after doing so implicitly decrements the task's reference count to zero. If so, the thread implicitly pushes the successor task onto the bottom of its deque. Completing the last child does not cause the reference count to become zero if the reference count includes extra references.

The example in Fibonacci Numbers does not have implicit pushing, because it explicitly waits for children to complete. It uses set_ref_count(3) for a task having only two children. The extra reference protects the successor from being implicitly pushed. Continuation Passing has a similar example that employs implicit pushing. It uses set_ref_count(2) for a task having two children, so that that task executes automatically when the children complete.

To summarize, the task scheduler's fundamental strategy is 'breadth-first theft and depth-first work". The breadth-first theft rule raises parallelism sufficiently to keep threads busy. The depth-first work rule keeps each thread operating efficiently once it has sufficient work to do.

[8] Double-ended queue.

For more complete information about compiler optimizations, see our Optimization Notice.