Download Now (PDF 1.61MB)
Virtual Reality (VR) is becoming more and more popular these days as technology advancement following Moore’s Law continues to make this brand new experience technically possible. While VR brings a fantastic immersive experience to users, it also puts significantly greater computing workloads on both the CPU and GPU compared to traditional applications due to dual-screen rendering, low latency, high resolution and high frame rate requirements. As a result, performance issues are especially critical in VR applications since a non-optimized VR experience with insufficient frame rate and high latency could cause nausea for users. In this article, we’ll introduce a general methodology to profile, analyze, and tackle bottlenecks and hotspots in a PC-based VR application regardless of the underlying engine or VR runtime used. We use a PC VR game from Tencent* called Pangu* as an example to showcase the analysis flow.
Before digging into the details of the analysis, we want to explain why the CPU plays an important role in VR and how it affects VR performance. Figure 1 shows the rendering pipeline in conventional games where CPU and GPU are processed in parallel in order to maximize the hardware utilization. However, the scheme cannot be applied to VR since VR requires a low and stable rendering latency, the rendering pipeline in conventional games doesn’t meet this requirement.
Let’s take Figure 1 as an example, if we look at the rendering latency of Frame N+2, we find that the latency is much longer than normal because GPU has to finish the workload of Frame N+1 before starts working on the workload of Frame N+2, thus introducing a significant latency to Frame N+2. Besides, the rendering latency is varying for Frame N, Frame N+1 and Frame N+2 due to different execution circumstances, which is also unfavorable in VR since it will introduce simulation sickness to users.
Figure 1: The rendering pipeline in conventional games.
As a result, the rendering pipeline in VR is changed to Figure 2 in order to achieve a shortest latency for each frame. In Figure 2, the CPU/GPU parallelism is intentionally broken in order to exchange efficiency for a low and stable rendering latency for each frame. In this case, CPU could be a bottleneck in VR since GPU has to wait for the CPU to finish pre-rendering jobs (drawcall preparation, initialization of dynamic shadowing, occlusion culling, etc.), optimization on CPU can help reduce the GPU bubbles and improve the performance.
Figure 2: The rendering pipeline in VR games.
Pangu* is a PC-based VR title from Tencent*, it’s a DirectX* 11 FPS VR game developed with Unreal Engine* 4 and supports both Oculus Rift* and HTC Vive*. We worked with Tencent* to improve the performance and user experience of the game in order to achieve a best- in-class gaming experience on Intel® Core™ i7 processors. Our result shows that during the development work outlined in this article the frame rate was significantly improved from 36.4 frames per second (fps) on Oculus Rift* DK2 (1920x1080) during early testing to 71.4 fps on HTC Vive* (2160x1200) at the time of this article. Here are the engines and VR runtimes used at the start and end of the development work:
The reason why different VR runtimes were used during development is that Pangu was initially developed on Oculus Rift DK2 since both Oculus Rift CV1 and HTC Vive have not been released yet at that time. Pangu was then migrated to HTC Vive once the device had been officially released. The adoption of different VR runtimes was evaluated and didn’t make a significant difference in the performance since both Oculus and SteamVR runtimes adopted the same VR rendering pipeline as shown in Figure 2, and the rendering performance is mainly determined by the game engine in this situation. It can also be verified in Figure 5 and Figure 14 that both Oculus and SteamVR runtimes inserted GPU work(for distortion pass) after the GPU rendering of each frame, which consumed only a small proportion of time with respect to the rendering.
Here shows the screenshots of the game before and after the optimization work, note that the number of drawcalls was reduced by 5X after optimization, and the GPU execution period for each frame was also reduced from 15.1ms to 9.6ms in average in order to fit the 90fps requirement on HTC Vive*, as seen in Figure 12 and 13:
Figure 3: Screenshots of the game before(left) and after(right) optimization.
The specifications of the test platform:
In order to better understand the potential performance issues of Pangu*, we first collected the basic performance metrics of the game, shown in Table 1. All the data in this table were collected using various tools including GPU-Z, TypePerf, and Unreal Frontend. If we compare the data to system idle, several observation can be made:
|System Idle||Pangu* on Oculus Rift* DK2 (before optimization)|
|GPU Core Clock (MHz)||135||1337.6|
|GPU Memory Clock (MHz)||162||1749.6|
|GPU Memory Used (MB)||184||1727.71|
|GPU Load (%)||0||49.64|
|Average Frame Rate (fps)||N/A||36.4|
|Draw Calls (/frame)||0||4437|
|Processor(_Total)\Processor Time (%)||1.04 (5.73/0.93/0.49/0.29/ 0.7/0.37/0.24/0.2)||13.58 (30.20/10.54/26.72/3.76/ 12.72/8.16/12.27/4.29)|
|Processor Information(_Total)\Processor Frequency (MHz)||800||2700|
Table 1: Basic performance metrics of the game before optimization.
In the following section, we use GPUView and Windows Performance Analyzer (WPA) from the Windows Assessment Development Kit (ADK)  to profile and analyze the bottlenecks in the VR workload.
GPUView  is a tool that can be used to investigate the performance interaction between graphics applications, CPU threads, graphics driver, Windows graphics kernel, and related interactions. This tool can also show whether an application is CPU bound or GPU bound in the timeline view. On the other hand, WPA  is an analysis tool that creates graphs and data tables of Event Tracing for Windows (ETW) events. It has a flexible UI that can be pivoted to view call stacks, CPU hotspots, context switches, and so on. It can also be used to explore the root cause of performance issues. Both GPUView and WPA can be used to analyze the event trace log (ETL) file captured by Windows Performance Recorder (WPR), which can be run from the user interface (UI) or from the command line, and have built-in profiles that can be used to select the events to be recorded.
For a VR application, it’s better to determine whether the application is bounded by the CPU, GPU, or both. We can focus our optimization efforts on the most critical part of the performance bottlenecks, thus achieving as much performance gain as possible with minimum effort.
Figure 4 shows the timeline view of Pangu* in GPUView before optimization, where the GPU work queue, CPU context queues, and CPU threads are all shown in Figure 4. Several facts can be concluded from the chart:
Figure 4: A timeline view of Pangu* in GPUView.
Preliminary recommendations for improving the frame rate and GPU utilization:
In order to take a deeper look into the bottleneck, we can use WPA to explore the same ETL file analyzed with GPUView. WPA can also be used to identify CPU hotspots in terms of CPU utilization or context switches; readers who are interested in this topic can refer to  for more details. Here we introduce the main methodology for CPU bottleneck analysis and optimization.
Look at a single frame of the VR workload that has performance issues. Since the present packet is submitted to the GPU once per frame after rendering, the timing between two succeeding present packets is the period of a single frame, as shown in Figure 5 (26.78 ms, which is equivalent to 37.34 fps).
Figure 5: A timeline view of Pangu* in GPUView for a single frame. Note the CPU threads that lead to GPU bubble.
Note that there are GPU bubbles in the GPU work queue (for example, 7.37 ms at the beginning of a frame) which were actually caused by the CPU thread bound in the VR workload, as marked in the red rectangle. It is because CPU tasks such as draw call preparation, culling, and the like must finish before GPU commands are submitted for rendering.
If we use WPA to look at the CPU bound periods shown in GPUView, we are able to find out the key CPU hotspots that prevent the GPU from execution. Figures 6–11 show the utilization and the call stacks of CPU threads in WPA, within the same time period in GPUView.
Figure 6: A timeline view of Pangu* in WPA with the same period as Figure 5.
Let’s look at the bottleneck of each CPU thread.
Figure 7: The call stack of the render thread T1864.
As seen in the call stack, the top three bottlenecks in the render thread are
These bottlenecks are caused by too many draw calls, state changes, and shadow map rendering in the render thread. Some suggestions to optimize the render thread performance:
Figure 8: The call stack of the game thread T8292.
For the game thread, the top three bottlenecks are
These bottlenecks can be optimized by reducing the number of view ports and the overhead of parallel animation evaluation at the CPU side. Use single-thread processing instead if only a few number of animation nodes are used, and examine the usage of mouse control at the CPU side.
Task threads (T8288, T4672, T8308):
Figure 9: The call stack of the task thread T8288.
Figure 10: The call stack of the task thread T4672.
Figure 11: The call stack of the task thread T8308.
For the task threads, bottlenecks are mostly located in physics-related simulations such as cloth simulation, animation evaluation, and particle system update.
Table 2 shows a summary of the CPU hotspots (percent of clockticks) during GPU bubble periods.
|Render thread||Base pass rendering for static meshes||13.1%||22.1%|
|Initialization of dynamic shadows||4.5%|
|Compute view visibility||4.5%|
|Game thread||Set up pre-requirements for parallel processing of animation evaluation||7.7%||16.7%|
|Redraw view ports||4.5%|
|Process Mouse Move Event||4.5%|
Table 2: CPU hotspots during GPU bubble periods before optimization.
After implementation of some of the optimization including Level of Detail (LOD), instanced stereo rendering, dynamic shadow removal, deferred CPU tasks and optimized physics, the frame rate was increased from 36.4 fps on Oculus Rift* DK2 (1920x1080) to 71.4 fps on HTC Vive* (2160x1200); the GPU utilization was also increased from 54.7 percent to 74.3 percent due to fewer CPU bottlenecks.
Figures 12 and 13 show the GPU utilization of Pangu* before and after optimization, respectively, as seen from the GPU work queue.
Figure 12: The GPU utilization of Pangu* before optimization.
Figure 13: The GPU utilization of Pangu* after optimization.
Figure 14: A timeline view of Pangu* in GPUView after optimization.
Figure 14 shows the Pangu* VR workload viewed from the GPUView after optimization. The CPU bottleneck period was decreased from 7.37 ms to 2.62 ms after optimization, which is achieved by the following optimizations:
Figures 15 shows the call stack of the CPU render thread in the CPU bottleneck period, as marked in the red rectangle shown in Figure 14.
Figure 15: The call stack of the render thread T10404.
Table 3 shows a summary of the CPU hotspots (percent of clockticks) during GPU bubble periods after optimization. Note that many of the hotspots and threads were removed from the CPU bottleneck as compared to Table 2.
|Render thread||Base pass rendering for static meshes||44.3%||52.2%|
Table 3: CPU hotspots during GPU bubble periods after optimization.
More optimizations, such as actor merging or using fewer materials, can be done to optimize the static mesh rendering in the render thread and further improve the frame rate. If CPU tasks were fully optimized, the processing time of a single frame could be further reduced by 2.62 ms (the period of CPU bottleneck in a single frame) to 11.38 ms, which is equivalent to 87.8 fps on average.
Table 4 shows the performance metrics before and after the optimization.
|System Idle||Pangu* on Oculus Rift* DK2 (before optimization)||Pangu* on HTC Vive* (after optimization)|
|GPU Core Clock (MHz)||135||1337.6||1316.8|
|GPU Memory Clock (MHz)||162||1749.6||1749.6|
|GPU Memory Used (MB)||184||1727.71||2253.03|
|GPU Load (%)||0||49.64||78.29|
|Average Frame Rate (fps)||N/A||36.4||71.4|
|Draw Calls (/frame)||0||4437||845|
|Processor(_Total)\Processor Time (%)||1.04 (5.73/0.93/0.49/0.29/ 0.7/0.37/0.24/0.2)||13.58 (30.20/10.54/26.72/3.76/ 12.72/8.16/12.27/4.29)||31.37 (46.63/27.72/33.34/18.42/ 39.77/19.04/46.29/19.76)|
|Processor Information(_Total)\Processor Frequency (MHz)||800||2700||2700|
Table 4: Basic performance metrics of the game before and after optimization.
In this article, we worked closely with Tencent* to profile and optimize the Pangu* VR workload on premier HMDs in order to achieve 90 fps on Intel® Core™ i7 processors. After implementing some of our recommendations, the frame rate was increased from 36.4 fps on Oculus Rift* DK2 (1920x1080) to 71.4 fps on HTC Vive* (2160x1200), the GPU utilization was also increased from 54.7 percent to 74.3 percent on average due to fewer CPU bottlenecks. The CPU bound period in a single frame was also reduced from 7.37 ms to 2.62 ms. Additional optimizations such as actor merging and texture atlasing could be done to further optimize the performance.
Profiling and analyzing a VR application with various tools gives insights on the behaviors and bottlenecks of the application, and it is essential to VR performance optimization since performance metrics alone might not reflect the real bottlenecks. The methodology and tools discussed in this article can be used to analyze VR applications developed with different game engines and VR runtimes, and determine whether the workload is bounded by CPU, GPU, or both. Sometimes the CPU has a larger impact to VR performance than the GPU due to drawcall preparation, physics simulation, lighting, or shadowing. After analyzing various VR workloads with performance issues, we found that many of them were CPU bounded, implying that CPU optimization can help improve the GPU utilization, performance, and the user experience of the applications.
Finn Wong is a senior application engineer in the Intel Software and Solutions Group (SSG), Developer Relations Division (DRD), Advanced Graphics Enabling Team (AGE Team). He joined Intel in 2012 and has been actively enabling third-party media, graphics and perceptual computing applications for the company’s PC products since then. Before joining Intel, Finn has seven years of experience and expertise in the fields of video coding, digital image processing, computer vision, algorithms and performance optimization, with several academic papers published in the literature as well. Finn holds a bachelor's degree in electrical engineering and a master's degree in communication engineering, all from National Taiwan University.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804