(note: this is slide 22 of the nulstein plog)
The v1 of nulstein was running on DX9 and only rendered a bunch of untextured cubes, it was much simpler than this version, and it is interesting to look at its performance precisely because it is so basic.
We can see three types of situations:
- eight threads running simultaneously: this is the whole of the update and the first phase of the draw
- the main thread running alone: this is draw calls submission (the bump on the left is when Present happens)
- no thread running at all: this is us waiting for vblank
We can see clearly that threading gives us benefits, even in such a simple case, with no collision, physics or character animation systems... You might wonder why the parallel threads seem to add up to more cycles than what is saved compared to the serial version: the reason is that this is captured on a Core i7, we really have 'only' four cores and the benefits of hyperthreading are limited in this case.
We also can see that if we added more cores, we wouldn't get much more returns: the time spent drawing is now the majority of the work. This is known as Amdhal's law, which really comes down to "your scaling is limited by your sequential code".
The interesting thing here is how much time is spent preparing draw calls: the DX runtime and/or the driver are spending a significant amount of time preparing the data and sending it over to the GPU. Also, if I was drawing anything more involved than my curvy-cubes, this would only grow bigger and some games are indeed "draw call bound", ie the GPU could draw more, the CPU could handle more objects, but the communication between the two is the bottleneck (ie you need to figure out ways to improve batching/instancing).
What would be cool would be if we could prepare/send draw calls in parallel !
Turns out that DX11 might have just what we need...
Next time, I'll introduce DX11 deferred contexts (2015 update: ...which I never found time to write, unfortunately)
Spoiler (slides+source code): here