Software Occlusion Culling Update 2

Published:09/06/2013   Last Updated:09/06/2013

Note: This version is obsolete but is being retained for historical purposes. Please check out the latest version: Software Occlusion Culling

This update consists of a couple of more rasterizer optimizations. The optimizations were implemented by Fabian Giesen and have been integrated into the sample. These 2 optimizations improve performance in terms of frame rate by ~13% and in terms of total cull time by ~ 27%.


Coarse Depth pretest :

After rasterizing the depth buffer, we create a coarse depth buffer. To create the coarse depth buffer we summarize each 8x8 block of depth values by determining the most distant Z coordinate (smallest z value in this case). Then, for each AABB, we loop over all 8x8 pixel blocks covered by the box. If the nearest Z value for the whole box is behind the farthest Z value of all the blocks covered by the box in the coarse/summary depth buffer, then we are sure that the box couldn’t possibly be visible. We mark such AABBs as occluded and avoid rasterizing and depth testing them. 

Binners store triangle data:

In the previous version during the triangle binning pass each bin stored the triangle Id for the triangles that belonged to the bin. However this caused an extra gather during the rasterization pass because the triangles had to be looked up by their ids and gathered. To avoid the extra gather in this update the triangle bins directly store the triangle data.


The performance for the Software Occlusion Culling sample was measured on a 2.3 GHz 4th gen Intel® Core™ processor (Haswell) system with 4 core / 8 threads and Intel® HD Graphics GT3CW. We set the rasterizer technique to SSE, the occluder size threshold to 1.5, the occludee size threshold to 0.01, and the number of depth test tasks to 20. We enabled frustum culling and multi-tasking and disabled vsync.  
The castle scene has 115 occluder models and 48700 occluder triangles. It has 27025 occludee models (occluders are treated as occludees) and ~1.9 million occludee triangles.

The time taken to rasterize the occluders to the depth buffer on the CPU was ~0.79 milliseconds, and the time taken to depth test the occludees was ~0.46 milliseconds. The total time spent on software occlusion culling was ~ 1.25 milliseconds.

No Optimizations

Multithreading +

Frustum Culling

Multithreading +

Frustum Culling +

Depth Test culling

Frame rate (fps) 25 60 190
Frame time (ms) 40 16.67 5.26
# of draw calls 23279 7360 1831

Attachment Size 65.5 MB

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804