Onloaded Shadows: Moving Shadow Map Generation from the GPU to the CPU

Download Article or Visit Onloaded Shadows page

Download Onloaded Shadows: Moving Shadow Map Generation from the GPU to the CPU [PDF 525KB]
Visit Onloaded Shadows Page (source, binaries, videos)

Introduction

In many games the sun moves very slowly, which means that generating shadow maps for static objects doesn't have to be done on a per-frame basis. Instead, they can be generated asynchronous to frame rendering, occurring only a few times per second or even less frequently. Although the workload could be broken up between multiple frames on the GPU, this would still be synchronous. The CPU can perform this workload asynchronously with Microsoft's* Windows Advanced Rasterization Platform* (WARP*) ¹ software rasterizer. Onloaded Shadows uses WARP to asynchronously generate shadows, and the copy of shadow data from the CPU to the GPU is the only synchronous work needed. Onloaded Shadows distributes the overhead of the copy operation across several frames to have as little impact as possible.


Figure 1: Screenshot of the application with Onloaded Shadows technique.

This sample uses WARP for CPU side rasterization to generate the shadow map on the CPU. WARP uses by default all available cores on a system, which results in stalls on the main thread due to contention. The only other setting that the WARP device supports is running on a single core. WARP is run on a single core because stalls on the main thread are unacceptable. Onloaded Shadows only uses two threads as a result.

Shadow Map Algorithm

The original Cascaded Shadow Maps algorithm isn't suitable for this sample because it's not view-invariant. A significant advantage of Cascaded Shadow Maps is that it only renders a shadow map for the areas directly intersected by the view frustum. Because the view frustum may rotate and move very quickly, the frustum can intersect areas not covered by the cascades before a new shadow map is generated by the onloaded pass. The solution used in Onloaded Shadows is to center the cascades on the view camera, which yields lower quality shadow maps but still keeps some of the advantages of the cascaded shadow map technique. This method also requires the view camera to move slowly relative to the scene; otherwise, the camera may see areas only covered by a lower quality cascade up close before a new shadow map can be generated. Exponential variance shadow maps (EVSM) are implemented for visual fidelity.²


Figure 2: Screenshot of the sample from the light's view with cascades visualized.

The nearest cascade is generated every frame entirely on the GPU, in order to allow dynamic shadows for nearby objects. The sample allows some, none, or all of the cascades to be run every frame on the GPU depending on how close to the viewer dynamic shadows are needed. This greatly increases the overhead with each cascade run in this manner. As demonstrated in the sample, keeping the nearest cascade processed every frame allows dynamic objects' shadows to appear when nearby the camera.

Technique Overview

The main thread renders the scene using shadow map data stored on the GPU, while the WARP thread generates the shadow map asynchronously. The WARP thread copies the shadow map to a staging buffer and maps it to a subresource, because although all resources are on the CPU, the WARP device can't copy resources directly from a buffer that is a render target. The GPU then updates its shadow buffer with the mapped subresource. The new camera data is utilized once a copy is complete, and then the WARP thread is signaled to once again begin shadow map generation.

Alternatively, asynchronous shadows can be faked by generating shadows synchronously on the GPU every set number of frames. In this way, a GPU technique which generates the same results can be used to compare with the Onloaded Shadows technique, and performance can be competitively compared by looking at how much of a spike in frame time occurs during either the subresource copy (for Onloaded Shadows) or during the synchronous shadow processing (for the GPU technique).

However, a significant frame time spike occurring every few seconds would cause a noticeable stall and would be disadvantageous to any game or product that uses this technique. Therefore, the work done during the synchronous frame is broken down into small enough pieces to cause as little impact as possible. This is called the Distributed Stall optimization. For the Onloaded Shadows technique, the synchronous copy can be subdivided as far as a single byte easily. For the GPU technique, because the work is not homogenous, breaking apart the shadow processing work becomes significantly more complicated. For this sample, the work was divided per cascade, and further divided between the original shadow draw and the various post processing passes on the shadow buffer. Two shadow buffers are used on the GPU because the updates are performed across many frames, with one being copied to while the other's data is used. Noticeable artifacts would appear while the copy is occurring if only a single shadow buffer were used. The roles of the two shadow buffers are then swapped.

Performance

The default viewport at 1024x768 resolution was tested for performance. In all scenarios, a 2000ms asynchronous update time was used, with the differing variables being the buffer sizes, the number of cascades, and the number of cascades running synchronously every frame on the GPU. For example, 1408x4+1 means that 4 cascades at 1408x1408 resolution were used, and 1 cascade was processed synchronously every frame on the GPU.

Four machines were tested for this technique to get a competitive comparison between the Onloaded Technique and fake asynchronous GPU Technique. The machines labeled 'SNB GT1' and 'SNB GT2' were Intel® microarchitecture code name Sandy Bridge-based machines, with 2.2 GHz processors and 4 GB of RAM. The machine labeled 'FX 770M' had an NVIDIA Quadro* FX 770M and dual 2.8 GHz processors with 4 GB of RAM. The machine labeled 'HD 5870' had an ATI Radeon* HD 5870 discrete graphics card and dual 3.20 GHz processors with 3 GB of RAM. All machines ran Microsoft Windows* 7.

The data collected is the frame time spikes for the stalls when the synchronous work is done. In the case of the Distributed Stall, the frame time spike is spread out across a number of frames, so the selected frame time is the maximum of these spikes. The frame times do not include the standard frame time without the spikes; this data is equivalent for all data points, so it is subtracted off of all of the data points. This is why the frame time can approach zero.

On the Intel® microarchitecture code name Sandy Bridge-based machines, the primary target of the Onloaded Shadows algorithm, the frame spike was between 2 to 4 times lower than the GPU technique. When utilizing the distributed stall optimization, the GPU technique frame spike was significantly lower but still noticeable, while with the Onloaded technique, the overhead is effectively eliminated. On Intel's processor graphics, the Onloaded Shadows technique is the fastest technique for handling asynchronously calculated shadows, with or without the distributed stall optimization.

On the NVIDIA Quadro* FX 770M, the Onloaded Shadows technique incurs more overhead than the GPU technique. This is expected as the data has to be transferred from the CPU to a discrete graphics card, which is significantly slower than when they are on the same die. However, when using the distributed stall technique, Onloaded Shadows is faster in every scenario, and once again approaches negligible frame times.

On the ATI Radeon* HD 5870, the GPU technique is much faster than the Onloaded Shadows in every aspect. The GPU is much faster at processing the shadow maps and the overhead of copying data from the CPU to a discrete graphics card remains. This demonstrates that onloading is not a viable technique for high end discrete graphics cards.

It should be noted that Onloaded Shadows has another advantage not apparent with the data above. Onloaded Shadows has an additional advantage of being much more consistent of a technique than the GPU technique. Because the synchronous work done is simply a copy operation, the overhead of that copy operation will never change as long as the buffer sizes and cascade counts stay the same. However, with the GPU technique, the GPU is drawing to a shadow buffer and applying various post processing to that buffer, which means that the speed is heavily dependent upon the camera position and the current scene complexity. As this could vary widely over the course of a video game, the frame time spike could also vary widely, and be much harder to distribute evenly when using the Distributed Stall optimization.

Should the architecture of WARP change at any point in the future to allow multithreading without stalling the main thread, this sample could take advantage of that functionality. Testing WARP with all of the threads enabled, ignoring the stalls, on an Intel® Core™ i7 processor demonstrates that shadow map generation on the CPU could potentially be improved by a factor of 300%, going from roughly 600ms to 200ms.

Future Work

There are various additions that could improve Onloaded Shadows. First of all, WARP is not a software rasterizer designed to be as fast as possible. Rather, it was meant to emulate the behavior of graphics hardware as much as possible. This yields less than optimal results, and a properly optimized software rasterizer could be used instead of WARP for performance benefits.

Another potential issue is fast-moving view frustums. When the view frustum moves fast enough, the view has a chance of seeing a low-quality cascade up close for a small amount of time. Frustum path predication can solve this for certain scenarios such as precomputed fly-throughs.

Instead of generating the shadows as soon as possible, shadow generation can be done smartly. Shadows can be generated only when the light has moved a certain amount or if the camera has moved enough to warrant it. This reduces the workload on the CPU to be used for other processes.

Finally, the Onloading technique could be expanded to other areas where graphics work does not have to be done per frame and instead only has to be done every few seconds. Some potential graphics techniques which could result from this could be Onloaded Environment Maps, Onloaded Lightmaps and Onloaded Global Illumination.

Conclusion

Shadow map generation can have a significant effect upon the speed that a GPU is able to render a scene. In certain scenarios, shadow map generation can be done asynchronously instead of every frame to enhance performance. Onloaded Shadows uses WARP to run the shadow map generation asynchronously on the CPU. Testing shows that Onloading Shadow map generation onto the CPU is significantly faster than any comparable GPU technique on machines with Intel integrated graphics. This technique is also viable for certain kinds of discrete graphics cards when properly optimized.

About the Authors

Zane Mankowski is a software engineer intern in the Intel Visual Computing Software Division. He is currently pursuing a Bachelor's degree in Computer Science at Rochester Institute of Technology.

Steve Smith, Doug Binks, and Jeff Andrews also contributed to this sample.

References / Resources

¹Glaister, Andy. Windows Advanced Rasterization Platform (WARP) In-Depth Guide. MSDN. 11/08

²Tuft, David. Cascaded Shadow Maps. MSDN. 06/10

For more complete information about compiler optimizations, see our Optimization Notice.