nulstein v2 plog - rendering overview



(note: this is slide 20 of the nulstein plog)

Once we have an up to date world, we move on to rendering it. Let's start by looking at how I attacked the problem for DX9, in the previous version of nulstein. In this context, all draw calls, all render state changes, everything needs to be submitted through the main thread, so we know there has to be a render phase that is purely serial. Still, we can spread the work related to deciding what to draw and prepping it: occlusion culling, LOD selection and setup, matrix palettes, particle systems, etc.

It all starts with Christer Ericson's solution to keeping things in order (Order your graphics draw calls around!) which really is about generating them in no particular order and sorting afterwards. Please go read this article if you haven't already, as there is no point in me paraphrasing all this good wisdom.

The result is

  • subphase 1: entities register (key,parm) pairs for what they need to draw

  • subphase 2: sort

  • subphase 3: talk to DX

The first phase can be a big nested set of loops depending on how your engine works, and these loops have a property we like: one iteration is totally independent of the next as an entity doesn't need information from another to decide what it wants to draw where (this should all have been calculated during the update phase). Also, because we're going to sort the list, we can have one list per thread and deal with adding items to it without a need for locks. Same pattern as before, zero contention everything happily happens simultaneously.

Sorting a bunch of such small keys is an easy job, especially that you don't really want to be sending tens of thousands of draw calls over DX... So, not only can this be spread over available cores, the item count is small enough that it's unlikely this phase will cause performance issues.

The last phase is serial:

The important thing to note is how the engine can manage most of the pipeline state by simply comparing the current key with the last one sent.
If the render target fields are different, change render target.
Translucency type changes, change states.
Material ids are different? set material up.
Then, when calling back the entity, it really only has to setup and dispatch its DrawPrimitive(s) which makes its job really simple.

Things get rendered in order, state changes are minimal and we still give entities the flexibility to run whatever code makes sense for them, which in most cases is only a few generic functions (mesh, light, billboard...)

An extension to this way of doing things is auto-generating instanced draws, and it would make perfect sense in this demo as it features only six models (cube-hi, cube, cube-lo, UFO-hi, UFO, UFO-lo). Add a few bits to the key to identify the template and the engine would just need to look keys ahead and call an alternate rendering routine when a sequence is detected. This would both make instancing relatively transparent to programmers, and keep the benefit of ordering, especially when translucency is involved.

It's all very simple and that's why I like this approach so much. But how well does it perform ?

Next time, performance of nulstein 1
Spoiler (slides+source code): here

 

 

 

 

 

 

For more complete information about compiler optimizations, see our Optimization Notice.

7 comments

Top
jerome-muffat-meridol (Intel)'s picture

Dmitriy,

You're right, I didn't point that out: since the world has been updated, there should be no need to write anything to entities. You can still find uses for an entity writing to itself, about what has end up being drawn. Sometimes it is useful, during update, to know that we're not visible anywhere so far, and a "simplified path" can be taken (or, inversely, that we're very visible). I think we can safely rule that an entity can only write to its "mind" during draw, and that game state is read-only.

The prioritization would be useful, and if there was some way to do the bulky work first (dependencies are just a special case of bulk) and keep the fine grained stuff for the end, it would make things really nice. Unfortunately, it's not so easy to do because of how TBB work (or my tiny scheduler), but also because dependencies and bulkiness are (tremendously) difficult to predict from frame to frame. So far, things behave ok because the number of entities is large compared to the dependency chains length...

Dmitry Vyukov's picture

> did I capture your question right?

Yes.
I get the point regarding dependencies. So Draw phase basically includes Update Mind-like sub-phase, that is entities can read any other entities (all entities are read-only), right?

> maybe we can have some entities start their draw phase while the very last ones are still finishing their update

I think that prioritizing entities during update phase is worth doing anyway. Consider, you have some number of independent entities, and some number of entities organized into a dependency list (that is, A->B->C->D....). If a scheduler will process all independent entities first, then all dependent entities will be processed in single thread (there is no available parallelism between them). So I think "dependency sources" should be processed first to the extent possible. And it will help with drawing, because reading during Draw phase is also a dependency.
However, yes, excessive synchronization may be a problem. Perhaps, executing separate Draw phase (where entities are allowed to read any other entities) may a better solution that tracking a lot of fine-grained dependencies.

jerome-muffat-meridol (Intel)'s picture

marshalsingh24,

I'm not sure what you mean by time loop, but will assume you mean "isn't the time spent in these nested loops going to turn into wasted time if most entities get culled?"

Yes, indeed, this is a potential problem. One thing that is implement in nulstein is a flag that can be set by entities that know they won't draw, like those who never draw (ex: cameras) or those that are currently hidden (ex: UFOs' initial state). The array that maps entities' ids to pointers holds this flag, so it really is a cheap fetch (ie won't cause any more cache misses than accessing just the pointer). The inner loop is very fast.

Then, you may want to ask why there isn't a bounding volume alongside this flag, then we could do culling right from the loop, and do a virtual call only when necessary. I have pondered doing this and couldn't decide on which bonding volume to use. In the end, my conclusion on this was that if you are confronted with a bottleneck at culling level, you don't want to just make the culling of an individual entity faster, you want to make it so that many entities get culled at once.

To achieve this, the simplest way is to hide the entities that apply (using the flag mentioned previously), and delegate the draw to another entity which will handle whatever acceleration structure makes sense (say BSP tree): the acceleration entity can then call each individual entity's draw function when it decides it is visible. This is what I meant in slide 10 when I said there is no "special-case" and every module you find in a game needs to become entities (http://software.intel.com/en-us/blogs/2010/09/20/nulstein-v2-plog-parallelizing-at-the-outer-loop/)

An to answer your question directly, yes It is doable to have the engine collect a list of "to-draw" entities during the Update phase (well, a list per thread, of course), and loop on these instead of looping on the main entities' array. The question then turns into: "what is faster, writing the list to memory or skipping over hidden entities?", I don't know.

jerome-muffat-meridol (Intel)'s picture

Dmitriy,

"What about having the entity say what it wants to draw during its update, and have the render phase start at the sort ?", did I capture your question right?

There is a big dependency chain that starts at the camera, and it can be very impractical to decide what to draw without knowing which camera(s) will be looking at you, and from where. There is the obvious question of culling: maybe this entity doesn't get drawn at all. There is also the question of Level-Of-Detail, maybe you want to use a simpler model in the distance, maybe it's simpler materials you want to use. You need to know the camera's parameters to decide what you want to render, and that's for every view where the entity is visible.

Then, why not update the cameras really early and make this possible? Camera might be attached to car, car might be involved in a bit of a mischief with another car (ie be a dependent entity), and here you have it: cars can't draw because they won't know where the camera is until they are finished updating.

But I must admit that I also have split the engine in update/draw because "that's the way it's always been done". Your question prompts me to look into the possibility of starting the draw phase early (ie not exactly what you suggest), maybe we can have some entities start their draw phase while the very last ones are still finishing their update. The non-trivial part is in the task scheduler: we need a way to prioritize cameras, and we need a way to have the scheduler pick up update tasks in priority over draw tasks. Having less synch points is always a good aim to have, and I have a feeling there is something worth digging, here.

marshal-singh's picture

sorry it's " time " in my above comment....

marshal-singh's picture

I think at some point of tome loop would start poving to be an overhead.

Dmitry Vyukov's picture

Is not it possible to combine Draw phase with Update State phase?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.