nulstein v2 plog - refresher on task scheduling, when the fairest is theft



(note: this is slide 9 of the nulstein plog)

Last time, we saw it makes sense for each worker thread to have its own TODO list, so it can pick from and add to it without bothering its neighbours. We also saw that, because we can't tell how long each individual task will take to run, we need some way to balance the load.

In the diagram from last slide, we had three queues, each initially containing thirteen tasks. In absence of any further information, giving each thread the same task count is the best approximation we can make and, as can be seen, results in something that's not necessarily balanced. It's good enough for the first half, until thread two finishes its list and starves. What needs to happen next is very simple: this thread wants to shop around for something to do. It finds a victim and grabs half of the remaining tasks (pretending, approximately again, that all tasks take the same time).

To be able to do this, we need to manage the potential for contention we just created: if a thread still is the only one that can add to its own queue, there is now a possibility that two threads try and pop from the same one. The thing to notice is that the level of contention is much lower than in the single job-queue I mentioned last time: only two threads are affected, the others continue pumping from their own queues, unaware. The victim thread often is unaware too: if we can steal from it, then chances are it is busy working while we're stealing from its queue (and if that probability turns out to be too low, then that should mean the tasks are cut too small...). After stealing, we have acquired a significant number of tasks and shouldn't have to repeat the process so soon.

The illustration above is naive in that it pretends we steal from the busiest thread and pick the exact amount of work that ends up balancing everything out neatly. In reality, things are more iterative, but the result is nevertheless something like the above: a reasonably balanced repartition of the work between threads (and if it turns out to not be the case, then that should mean some tasks aren't cut small enough and run towards the end). This good balance is really important in this project as this is what lets us think in terms of sequential phases without worrying too much about contention in between phases.

Of course, things aren't that simple. I'm sure you noticed I first mentioned tasks could be cut too small, and then that they might not be cut small enough, that probably got you worried. In practice, what is detrimental is to have tasks that are too different in size, like an order of magnitude. It's the classic rock-stone-sand tale. There are two ways to address it: figure out a way to get the rocks in first, then stones, then sand or break rocks into stones and stones into sand as we go. The latter usually turns out to work great and, because it happens "just in time", the overhead of "grinding" gets spread over all cores involved. Tuning tasks is mainly about identifying the big ones (which is easy) and to focus on parallel-ising them (== have them queue more subtasks). One such example in the nulstein demo is the music playback that started off as a "synth" task that could take too long and was made to queue several "synth track" tasks as part of its execution.

There are also many complications inside the scheduler itself. Like how to best distribute tasks initially. Or how to limit contention during the "end game": when only a few threads run the last remaining tasks and idle workers need to avoid hammering the queues with requests for work, while still being on the ready in case a burst of new tasks happens. Various strategies can be used to decide who to steal from, and how much to take. The nice thing about all these remaining complications is that they can be hidden inside a black-box: the task scheduler.

The purpose of this slide, and the previous one, was to give people unfamiliar with the topic enough of a peek inside this black box that they can have a better idea of what it does and why it is an efficient way to load-balance the system we are building. If you want to know more on the topic, I can wholeheartedly recommend the Intel Threading Building Blocks book as it has been the perfect stepping stone for me.


Next time, Game Entities
Spoiler (slides+source code): here

For more complete information about compiler optimizations, see our Optimization Notice.
Tags: