nulstein v2 plog - refresher on task scheduling, when the fairest is theft

(note: this is slide 9 of the nulstein plog)

Last time, we saw it makes sense for each worker thread to have its own TODO list, so it can pick from and add to it without bothering its neighbours. We also saw that, because we can't tell how long each individual task will take to run, we need some way to balance the load.

In the diagram from last slide, we had three queues, each initially containing thirteen tasks. In absence of any further information, giving each thread the same task count is the best approximation we can make and, as can be seen, results in something that's not necessarily balanced. It's good enough for the first half, until thread two finishes its list and starves. What needs to happen next is very simple: this thread wants to shop around for something to do. It finds a victim and grabs half of the remaining tasks (pretending, approximately again, that all tasks take the same time).

To be able to do this, we need to manage the potential for contention we just created: if a thread still is the only one that can add to its own queue, there is now a possibility that two threads try and pop from the same one. The thing to notice is that the level of contention is much lower than in the single job-queue I mentioned last time: only two threads are affected, the others continue pumping from their own queues, unaware. The victim thread often is unaware too: if we can steal from it, then chances are it is busy working while we're stealing from its queue (and if that probability turns out to be too low, then that should mean the tasks are cut too small...). After stealing, we have acquired a significant number of tasks and shouldn't have to repeat the process so soon.

The illustration above is naive in that it pretends we steal from the busiest thread and pick the exact amount of work that ends up balancing everything out neatly. In reality, things are more iterative, but the result is nevertheless something like the above: a reasonably balanced repartition of the work between threads (and if it turns out to not be the case, then that should mean some tasks aren't cut small enough and run towards the end). This good balance is really important in this project as this is what lets us think in terms of sequential phases without worrying too much about contention in between phases.

Of course, things aren't that simple. I'm sure you noticed I first mentioned tasks could be cut too small, and then that they might not be cut small enough, that probably got you worried. In practice, what is detrimental is to have tasks that are too different in size, like an order of magnitude. It's the classic rock-stone-sand tale. There are two ways to address it: figure out a way to get the rocks in first, then stones, then sand or break rocks into stones and stones into sand as we go. The latter usually turns out to work great and, because it happens "just in time", the overhead of "grinding" gets spread over all cores involved. Tuning tasks is mainly about identifying the big ones (which is easy) and to focus on parallel-ising them (== have them queue more subtasks). One such example in the nulstein demo is the music playback that started off as a "synth" task that could take too long and was made to queue several "synth track" tasks as part of its execution.

There are also many complications inside the scheduler itself. Like how to best distribute tasks initially. Or how to limit contention during the "end game": when only a few threads run the last remaining tasks and idle workers need to avoid hammering the queues with requests for work, while still being on the ready in case a burst of new tasks happens. Various strategies can be used to decide who to steal from, and how much to take. The nice thing about all these remaining complications is that they can be hidden inside a black-box: the task scheduler.

The purpose of this slide, and the previous one, was to give people unfamiliar with the topic enough of a peek inside this black box that they can have a better idea of what it does and why it is an efficient way to load-balance the system we are building. If you want to know more on the topic, I can wholeheartedly recommend the Intel Threading Building Blocks book as it has been the perfect stepping stone for me.

Next time, Game Entities
Spoiler (slides+source code): here

For more complete information about compiler optimizations, see our Optimization Notice.


jerome-muffat-meridol (Intel)'s picture

Maybe they seem inconsistent because I'm probably not clear enough: when I'm talking about identifying big ones, I'm thinking about profiling the game and identifying tasks that turn out to be big and, then, changing their code. The example of the softsynth shows the thought process: "hey, that task often shows up as the last one to run, it takes too long to finish, let's break it down" - and, sure enough, after a bit of coding, the update phase became better balanced.

In a game, you can't predict how long a task is going to take without breaking things down to unmanageable levels. The classic case is the one where you need to cast raytests, like when checking if a guard has anybody in his line of sight. The raytest itself will take a very different time depending if you're looking towards the sky or towards the horizon. Then, the consequences of that raytest hitting something may trigger behaviours that can take various amounts of time too. So, yes, you could break all these things down to such a level that each aspect of the problem is separate and executes in known time (on platforms like the PS3 where things have to be broken down in small parts anyways), and it can make sense to do. I'm exploring a different approach where you want to break things down enough that they can be load-balanced, but remain somewhat chunky nevertheless. This makes life easier in a very interactive development process, in a game you want to tune the rules until they are _fun_ and this can mean pretty radical happenning as late as during beta (this is actually the point where the ratio of breadth of changes to be made by available time can be the highest...)

I stand by my statement: trying to evaluate how long an individual game-behaviour is going to take is akin to me trying to tell how long I'll take to solve a given sudoku grid (without knowing the difficulty level). I can give an upper bound, I can probably give an average, but it won't help giving any useful prediction. I wouldn't be surprised if the time to run an individual behaviour could be thought of as what Benoît Mandelbrot calls "wild random"... And really, as long as we have several times more game entities as we have physical threads, I don't see this as a big issue.

Dmitry Vyukov's picture

> We also saw that, because we can't tell how long each individual task will take to run, we need some way to balance the load.
> Tuning tasks is mainly about identifying the big ones (which is easy) and to focus on parallel-ising them (== have them queue more subtasks)

Those statements are inconsistent. If you have no idea how long each individual task will take to run, how you possibly able to divide big tasks into small tasks?

If you write the code, in what way you can't tell how long each individual task will take to run in the first place (code that accesses web-services aside for a moment)? The knowledge is required to do any sane work-balancing, so you need to do the estimation anyway.
Perhaps, you meant that tasks made somehow different in size to reduce development complexity (i.e. tasks are of some *natural* granularity for the problem).

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.