This is part 2 of a tutorial to help developers improve the performance of their games in Unreal Engine* 4 (UE4). In this tutorial, we go over a collection of tools to use within and outside of the engine, as well some best practices for the editor, and scripting to help increase the frame rate and stability of a project.
Deferred is the standard rendering method used by UE4. While it typically looks the best, there are important performance implications to understand, especially for VR games and lower end hardware. Switching to Forward Rendering may be beneficial in these cases.
For more detail on the effect of using Forward Rendering, see the Epic documentation.
If we look at the Reflection scene from Epic’s Marketplace, we can see some of the visual differences between Deferred and Forward Rendering.
Figure 13: Reflection scene with Deferred Rendering
Figure 14: Reflection scene with Forward Rendering
While Forward Rendering comes with a loss of visual fidelity from reflections, lighting, and shadows, the remainder of the scene remains visually unchanged and performance increase maybe worth the trade-off.
If we look at a frame capture of the scene using the Deferred Rendering in the Intel GPA Frame Analyzer tool, we see that the scene is running at 103.6 ms (9 fps) with a large duration of time being taken by lighting and reflections.
Figure 15: Capture of the Reflection scene using Deferred Rendering on Intel® HD Graphics 530
When we look at the Forward Rendering capture, we see that the scene’s runtime has improved from 103.6 to 44.0 ms, or 259 percent improvement, with most time taken up by the base pass and post processing; both of which can be optimized further.
Figure 16: Capture of the Reflection scene using Forward Rendering on Intel® HD Graphics 530
Static Meshes within UE4 can have thousands, even hundreds of thousands of triangles in their mesh to show all the smallest details a 3D artist could want to put into their work. However, when a player is far away from that model they won’t see any of that detail, even though the engine is still rendering all those triangles. To solve this problem and optimize our game we can use Levels of Detail (LOD) to have that detail up close, while also showing a less intensive model at a distance.
In a standard pipeline, LODs are created by the 3D modeler during the creation of that model. While this method allows for the most control over the final appearance, UE4 now includes a great tool for generating LODs.
To auto generate Static Mesh LODs, go into that model’s details tab. On the LOD Settings panel select the Number of LODs you would like to have.
Figure 17: Creating auto generated level of details.
Clicking Apply Changes signals the engine to generate the LODs and number them, with LOD0 as the original model. In the example below, we see that the LOD generation of 5 takes our Static Mesh from 568 triangles to 28— a huge optimization for the GPU.
Figure 18: Triangle and vertex count, and the screen size setting for each level of detail.
When we place our LOD mesh in scene we can see the mesh change the further away it is from the camera.
Figure 19: Visual demonstration of level of detail based on screen size.
Another feature of LODs is that each one can have its own material, allowing us to further reduce the cost of our Static Mesh.
Figure 20: Material instances applied to each level of detail.
For example, the use of normal maps has become standard in the industry. However, in VR there is a problem; normal maps aren’t ideal up close as the player can see that it’s just a flat surface.
A way to solve this issue is with LODs. By having the LOD0 Static Mesh detailed to the point where bolts and screws are modeled on, the player gets a more immersive experience when examining it up close. Because all the details are modeled on, the cost of applying a normal map can be avoided on this level. When the player is further away from the mesh and it switches LODs, a normal map can then be swapped in while also reducing the detail on the model. As the player gets even further away and the mesh gets smaller, the normal map can again be removed, as it becomes too small to see.
Every time anything is brought into the scene it corresponds to an additional draw call to the graphics hardware. When this is a static mesh in a level, it applies to every copy of that mesh. One way to optimize this, if the same static mesh is repeated several times in a level, is to instance the static meshes to reduce the amount of draw calls made.
For example, here we have two spheres of 200 octahedron meshes; one set in green, and the other in blue.
Figure 21: Sphere of static and instanced static meshes.
The green set of meshes are all standard static meshes, meaning that each has its own collection of draw calls.
Figure 22: Draw calls from 200 static mesh spheres in scene (Max 569).
The blue set of meshes are a single-instanced static mesh, meaning that they share a single collection of draw calls.
Figure 23: Draw calls from 200 instanced static mesh spheres in scene (Max 143).
Looking at the GPU Visualizer for both, the Base Pass duration for the green (static) sphere is 4.30 ms and the blue (instanced) sphere renders in 3.11 ms; a duration optimization of ~27 percent in this scene.
One thing to know about instanced static meshes is that if any part of the mesh is rendered, the whole of the collection is rendered. This wastes potential throughput if any part is drawn off camera. It’s recommended to keep a single set of instanced meshes in a smaller area; for example, a pile of stone or trash bags, a stack of boxes, and distant modular buildings.
Figure 24: Instanced Mesh Sphere still rendering when mostly out of sight.
If collections of static meshes that have LODs are used, consider a Hierarchical Instanced Static Mesh.
Figure 25: Sphere of Hierarchical Instanced Meshes with Level Of Detail.
Like a standard instanced mesh, hierarchical instances reduce the number of draw calls made by the meshes, but the hierarchical instance also uses the LOD information of its meshes.
Figure 26: Up close to that sphere of Hierarchical Instanced Meshes with Level Of Detail.
In UE4, occlusion culling is a system where objects not visible to the player are not rendered. This helps to reduce the performance requirements of a game as you don’t have to draw every object in every level for every frame.
Figure 27: Spread of Octohedrons.
To see the occluded objects with their green bounding boxes, you can enter r.VisualizeOccludedPrimitives 1 (0 to turn off) into the console command of the editor.
Figure 28: Viewing the bounds of occluded meshes with r.VisualizeOccludedPrimitives 1
The controlling factor of whether or not a mesh is drawn is relative to its bounding box. Because of this, some drawn objects that may not be visible to the player, but the bounding box is visible to the camera.
Figure 29: Viewing bounds in the meshes details window.
If a mesh needs to be rendered before a player sees it, for additional streaming time or to let an idle animation render before being seen for example, the size of the bounding boxes can be increased under the Static Mesh Settings > Positive Bounds Extension and Negative Bounds Extension in the meshes settings window.
Figure 30: Setting the scale of the mesh’s bounds.
As the bounding box of complex meshes and shapes always extend to the edges of those meshes, creating white space will cause the mesh to be rendered more often. It is important to think about how mesh bounding boxes will affect the performance of the scene.
For a thought experiment on 3D model design and importing into UE4, let’s think about how a set piece, a colosseum-style arena, could be made.
Imagine we have a player standing in the center of our arena floor, looking around our massive colosseum, about to face down his opponents. When the player is rotating the camera around, the direction and angle of the camera will define what the game engine is rendering. Since this area is a set piece for our game it is highly detailed, but to save on draw calls we need to make it out of solid pieces. First, we are going to discard the idea of the arena being one solid piece. In this case, the number of triangles that have to be drawn equals the entire arena because it’s all drawn as a single object, in view or not. How can the model be improved to bring it into the game?
It depends. There are a few things that will affect our decision. First is how the slices can be cut, and second is how those slices will affect their bounding boxes for occlusion culling. For this example, let’s say the player is using a camera angle of 90 degrees, to make the visuals easier.
If we look at a pizza-style cut, we can create eight identical slices to be wheeled around a zero point to make our whole arena. While this method is simple, it is far from efficient for occlusion, as there are a lot of overlapping bounding boxes. If the player is standing in the center and looking around, their camera will always cross three or four bounds, resulting in half the arena being drawn most the time. In the worst case, with a player standing back to the inner wall and looking across the arena, all eight pieces will be rendered, granting no optimization.
Next, if we take the tic-tac-toe cut, we create nine slices. This method is not quite orthodox, but has the advantage that there are no overlapping bounding boxes. As with the pizza cut, a player standing in the center of the arena will always cross three or four bounds when standing in the middle of the arena. However, in the worst case of the player standing up against the inner wall, they will be rendering six of the nine pieces, giving an optimization over the pizza cut.
As a final example, let’s make an apple core cut (a single center piece and eight wall slices). This method is the most common approach to this thought experiment and, with little overlap, a good way to build out the model. When the player is standing in the center they will be crossing five or six bounds, but unlike the other two cuts, the worst case for this cut is also five or six pieces rendered out of nine.
Figure 31: Thought experiment showing how a large model can be cut up, and how that effects bounding boxes and their overlap.
Dynamic Shadow Cascades bring a high level of detail to your game, but they can be expensive and require a powerful gaming PC to run without a loss of frame rate.
Fortunately, as the name suggests, these shadows are dynamically created every frame, so can be set in game to allow the player to optimize to their preferences.
Cost of Dynamic Shadow Cascades using Intel® HD Graphics 350
The level of Dynamic Shadow Cascades can be dynamically controlled in several ways:
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804