Software Occlusion Culling Update 2

This update consists of a couple of more rasterizer optimizations. The optimizations were implemented by Fabian Giesen and have been integrated into the sample. These 2 optimizations improve performance in terms of frame rate by ~13% and in terms of total cull time by ~ 27%.

 

Coarse Depth pretest :

After rasterizing the depth buffer, we create a coarse depth buffer. To create the coarse depth buffer we summarize each 8x8 block of depth values by determining the most distant Z coordinate (smallest z value in this case). Then, for each AABB, we loop over all 8x8 pixel blocks covered by the box. If the nearest Z value for the whole box is behind the farthest Z value of all the blocks covered by the box in the coarse/summary depth buffer, then we are sure that the box couldn’t possibly be visible. We mark such AABBs as occluded and avoid rasterizing and depth testing them. 

Binners store triangle data:

In the previous version during the triangle binning pass each bin stored the triangle Id for the triangles that belonged to the bin. However this caused an extra gather during the rasterization pass because the triangles had to be looked up by their ids and gathered. To avoid the extra gather in this update the triangle bins directly store the triangle data.

Performance:

The performance for the Software Occlusion Culling sample was measured on a 2.3 GHz 4th gen Intel® Core™ processor (Haswell) system with 4 core / 8 threads and Intel® HD Graphics GT3CW. We set the rasterizer technique to SSE, the occluder size threshold to 1.5, the occludee size threshold to 0.01, and the number of depth test tasks to 20. We enabled frustum culling and multi-tasking and disabled vsync.  
The castle scene has 115 occluder models and 48700 occluder triangles. It has 27025 occludee models (occluders are treated as occludees) and ~1.9 million occludee triangles.

The time taken to rasterize the occluders to the depth buffer on the CPU was ~0.79 milliseconds, and the time taken to depth test the occludees was ~0.46 milliseconds. The total time spent on software occlusion culling was ~ 1.25 milliseconds.

  SSE
No Optimizations

Multithreading +

Frustum Culling

Multithreading +

Frustum Culling +

Depth Test culling

Frame rate (fps) 25 60 190
Frame time (ms) 40 16.67 5.26
# of draw calls 23279 7360 1831
Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.