nulstein v2 plog - fully parallel game loop demo

(note: this is slide 6 of the nulstein plog)

I hear that the last slide was a bit of a cliffhanger, sorry for having kept the wait so long: I was trying to make a video to attach to this page and I just can't seem to get it right (feels a bit like not being able to spell AC/DC, I must say). I thought I should just move on and ask you to kindly go and download the binaries from the nulstein page instead.

As a matter of fact, I didn't do a much better job of presenting this demo in Cologne... I used a Core i7 980X, six hyperthreaded cores, fitted with an ATI 5870: I should have expected that connecting it to a 1024x768 projector would be akin to bringing a big double-action six-shooter loaded with dumdums to a fun fair stall... GPU was left with nothing much to do, and it became really difficult to make a clear demo out of this...

Let's try again!

In the binaries archive, there is a bin directory with two subdirectories:

  • Win32.DX11.Release: contains a 32bits executable that uses demoscene tricks to remain small (using farbrausch's kkrunchy, it packs under 32K, which isn't magnificent but is cool still)

  • x64.DX11.Release: contains 64bits binaries without tricks, ie more like what you'd find in a video-game

Pick the executable you want, they do the same thing ; there might be a slight advantage to the 32bits version as it doesn't depend on any DLL and can run straight from the archive. You will see a bunch of cubes, I'll spare you the counting: there are 4096 of them. There are also 4098 point lights, two of them are static and wide-ranged to give some basic lighting to the scene, the 4096 remaining ones have got short ranges, are brightly coloured and each follows a corner of a different cube. They illustrate a classic dependency case: the position of a light cannot be updated until we know the position+orientation of its parent cube, this is the same kind of situation you find in a racing game where the camera wants to be placed at a given position+orientation relative to a car. These lights also jump from one cube to another on a regular basis, flying on their own for a while. This is meant to illustrate the fact that dependencies in games never last forever and will change.

The lights are also coming from an important choice of rendering method: I wanted to see how DX11 deferred contexts would deal with code that would need several passes. So I thought deferred shading would do just that: cubes would go to a GBuffer, and lights would make pixels visible... The only problem is we now have deferred contexts and deferred shading, which is bound to create confusion. I guess "deferred" appeals to coders, alongside JustInTime, because we keep postponing things and always struggle to explain why it can actually be the right thing to do... I know, I'm digressing again, and we're not making progress.

So, you have spinning cubes and rotating lights. Press 'U' and you'll get invaded by UFOs on top, that's another 4096 objects, they fly forward avoiding cubes and do a 180 turn when they come near the end of the "playzone". The original idea was to have them fight each other, but as we'll see a little later I came into a bit of a fistyfight with DX11 before getting the performance I expected, and that ate my time budget. Next time...

And now, for the real question: "what about performances ?"

  • run nulstein

  • press F1, this shows the information console. You'll see the frame-rate (averaged over 16 frames and over 128) and also the number of threads in use.

  • press 'V' to disable VSync

  • press 'W' to enable dummy work

  • press the Up arrow until the frame rate (just) starts to drop

What is this dummy work? Each rotating cube has a loop during its update (ie the place where CPU cycles are actually useful to the game), doing a cosinus per iteration. If your Dummy Work count is 4096, then each cube is doing 4096 cos ops (ie 16M cos ops in total). This works like a tare, when frame rate just starts to drop, then you are becoming CPU bound.

You can change the number of physical threads in use with the keys '1'..'0' (above characters, not numpad), with '0' meaning 'all'.
Change the amount of dummy work, by using the Up/Down arrows.

On my computer (Core i7 980X, 3.33GHz, ATI 5870, at 1920x1200), I get these numbers before frame rate starts to be affected:

  • 12 threads: around 4000

  • 6 threads: around 3500

  • 4 threads: around 1200

  • 2 threads: around 400

This is a very approximate way to measure how much additional CPU muscle is made available, but it gives an idea of how much more work each entity-behaviour can get done in a single frame.

So, what numbers do you get?

Next time, a bit of history before diving in details
Spoiler (slides+source code): here



For more complete information about compiler optimizations, see our Optimization Notice.