Software Occlusion Culling Update 2

Note: This version is obsolete but is being retained for historical purposes. Please check out the latest version: Software Occlusion Culling

This update consists of a couple of more rasterizer optimizations. The optimizations were implemented by Fabian Giesen and have been integrated into the sample. These 2 optimizations improve performance in terms of frame rate by ~13% and in terms of total cull time by ~ 27%.


Coarse Depth pretest :

After rasterizing the depth buffer, we create a coarse depth buffer. To create the coarse depth buffer we summarize each 8x8 block of depth values by determining the most distant Z coordinate (smallest z value in this case). Then, for each AABB, we loop over all 8x8 pixel blocks covered by the box. If the nearest Z value for the whole box is behind the farthest Z value of all the blocks covered by the box in the coarse/summary depth buffer, then we are sure that the box couldn’t possibly be visible. We mark such AABBs as occluded and avoid rasterizing and depth testing them. 

Binners store triangle data:

In the previous version during the triangle binning pass each bin stored the triangle Id for the triangles that belonged to the bin. However this caused an extra gather during the rasterization pass because the triangles had to be looked up by their ids and gathered. To avoid the extra gather in this update the triangle bins directly store the triangle data.


The performance for the Software Occlusion Culling sample was measured on a 2.3 GHz 4th gen Intel® Core™ processor (Haswell) system with 4 core / 8 threads and Intel® HD Graphics GT3CW. We set the rasterizer technique to SSE, the occluder size threshold to 1.5, the occludee size threshold to 0.01, and the number of depth test tasks to 20. We enabled frustum culling and multi-tasking and disabled vsync.  
The castle scene has 115 occluder models and 48700 occluder triangles. It has 27025 occludee models (occluders are treated as occludees) and ~1.9 million occludee triangles.

The time taken to rasterize the occluders to the depth buffer on the CPU was ~0.79 milliseconds, and the time taken to depth test the occludees was ~0.46 milliseconds. The total time spent on software occlusion culling was ~ 1.25 milliseconds.

No Optimizations

Multithreading +

Frustum Culling

Multithreading +

Frustum Culling +

Depth Test culling

Frame rate (fps)2560190
Frame time (ms)4016.675.26
# of draw calls2327973601831
For more complete information about compiler optimizations, see our Optimization Notice.


Mark D.'s picture

The archive with sample is corrupted. Could you please reupload it?

Mark D.'s picture

The archive with sample is corrupt. Could you please reupload it?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.