This guide introduces virtual reality (VR) developers to rudimentary analysis and optimization techniques they can apply to their projects. It is not intended as a comprehensive reference or replacement for existing guides but rather a quick-start tutorial on VR application performance analysis and optimization, with a sample of engineer-vetted recipes for achieving results quickly.
Use it as a primer on:
While currently focused on Windows* tools and methods, many of the techniques apply to tuning Linux* and macOS* VR applications. This guide will be updated frequently to reflect our evolving knowledge in the rapidly changing world of VR development. We welcome your suggestions and feedback.
The increasing complexity of VR applications requires sophisticated performance analysis. No single tool will render a complete picture, so you must use multiple tools and methods to gain a comprehensive understanding of an application’s performance characteristics. We chose to cover the tools we frequently use and have found to be the most useful in various VR software developer kits (SDKs) and development frameworks. However, many more tools are available for the type of analysis we cover in this guide.
Figure 1. Continuum of analysis.
Figure 1 illustrates the various performance-analysis levels across the software stack. Less time is spent on high-level analysis at the system level; the most time is spent on low-level analysis at the microarchitecture, or uArch, level. In other words, the time spent increases as the complexity of the data collected, and the time required to analyze that data, increases. Note that any tool may lead you to the cause of a bottleneck and it is not always necessary to work through all the analysis levels.
At the system level, general metrics are needed to properly aim and drive optimization decisions. With tools such as Windows Typeperf see the VR Application Tuning Recipes section, you can characterize the application load on a system quickly and easily, which immediately helps identify obvious issues. Table 1 highlights a few metrics and rules of thumb to follow.
If you’re concerned about power consumption, various tools measure the power used throughout the workload. Intel® Power Gadget monitors the energy model-specific registers (MSR) reported by the CPU, computes the watts of power consumed by the CPU, and then reports the measurements in a CSV file. The tool also reports CPU utilization, frequency, and temperature, as well as GPU metrics parts from Intel. (You will need to determine where your application draws power by lining up workload markers with the reported timestamp in the CSV file.)
After you have analyzed the system and identified possible bottlenecks, capture the OS stack information to further examine threads, modules, and functions (should symbols be available). Windows Performance Analyzer (WPA), a tool developed by Microsoft, post-processes samples captured by Windows Performance Recorder (WPR). WPA displays the captured data in graphs and tables through a graphical user interface, and presents data such as system activity, computation, storage, memory usage, and complete call stacks (if symbols are available).
For the CPU, when call stacks aren’t deep enough to identify and fix bottlenecks, use Intel® VTune™ Amplifier to identify serial and parallel bottlenecks. Analysis from the VTune Amplifier helps you study algorithm choices, and understand where and how your application can benefit from available hardware resources. VTune Amplifier also provides a view into the assembly code by disassembling the target application. The tool presents hardware performance-counter data in graphs and tables, and can cross-link into source code (if symbols are available) and assembly. Using VTune Amplifier can also help reveal deep microarchitectural issues that might be limiting your application’s performance when scaled across threads or run as part of a larger stack.
As for the GPU, it’s sometimes necessary to understand the performance of the GPU at a low level to identify bottlenecks. Use the Intel® Graphics Performance Analyzers (Intel® GPA) to study the GPU and determine if your application is GPU-bound or CPU-bound. Going deeper, Intel GPA can also identify hotspots in the graphics pipeline and provide draw-call analysis at the frame level, allowing you to study a particular frame and identify areas to optimize.
TypePerf, built into Windows, can be executed from the command prompt. It samples at 1Hz, and provides enough granularity to identify potential issues with minimal overhead. It can also be run throughout the duration of most workloads.
A typical command looks like this:
typeperf –cf typeperfinput.txt –o workload_perfmon.csv
The –cf switch allows a text file that lists all the desired metrics to be input. See the Appendix for the list we use for data collection. The –o switch identifies the output file. Typically users would open the resulting CSV file in a spreadsheet application to graph its values and identify system issues.
|Processor(_Total)\% Processor Time||60.88|
|Processor(_Total)\% User Time||38.17|
|Processor(_Total)\% Privileged Time||22.71|
|Processor(_Total)\% Interrupt Time||0.30|
|System\Context Switches/sec||19559.96||Context switches are high|
|System\System Calls/sec||1015040.67||System calls are extremely high!|
|Processor(_Total)\Interrupts/sec||19417.73||Interrupts are high at 20k/s|
|Memory\Demand Zero Faults/sec||21291.03|
Figure 2. Typical chart of TypePerf metrics
Interpreting this data quickly takes time to master. Table 2 below contains our guidelines for examining the core set of of TypePerf metrics.
Table 2. Core TypePerf metrics
|Context Switches||Whenever a logical core switches from executing one thread to another||Less than 10k/s per active running thread|
Calls to the operating system service
|Less than 50k/s per active running thread|
|Page Faults||When a thread refers to a page that is not in the OS current working set||Less than 50k/s per active thread (if majority are soft faults)|
|Hardware Interrupts||Devices interrupt the processor when they have completed a task or require attention||Less than 6-7k|
|Processor Queue||Threads in the queue ready to be executed||More than 1 indicates a bottleneck|
|Average Disk Queue||Average number of both read and write requests that were queued for the selected disk during the sample interval||More than 1 means you are partially gated on disk IO, More than 2 and you are complete gated by disk IO|
|Processor Frequency||Self-explanatory||No knowledge of turbo state (P-state)|
WPA (figure 3) makes use of Windows Performance Recorder (WPR), a tool based on event tracing for Windows (ETW). WPR records system events that can be analyzed using WPA. WPR records the data during testing, and WPA displays the captured data (CPU, GPU, etc.). WPA can be obtained free of charge through the Windows Assessment and Deployment Kit (Windows ADK).
Figure 3. Windows* performance analyzer displays CPU and GPU usage data
WPR can also be executed from the command line through the log.cmd file, installed under the gpuview folder within the Windows Performance Toolkit folder. Call log.cmd from the command prompt to both start and stop collection. After stopping data collection, a set of traces will be created in the gpuview folder; merged.etl is the file of interest. This file can be opened in WPA for analysis. To Install WPR, WPA, get the Windows* Assessment and Deployment Kit (Windows ADK).
VTune Amplifier 2018 (figure 4) allows you to find serial and parallel code bottlenecks, and speed execution. Use this tool to analyze algorithm choices, and understand where and how your application can benefit from available hardware resources. Download a trial version of VTune Amplifier.
Figure 4. Intel® VTune™ Amplifier 2018 lets you identify and analyze code bottlenecks to optimize performance on modern CPUs
Use PresentMon to trace ETW events related to swap chain presentation on Windows. It can capture and analyze key performance metrics for graphics applications (for example, CPU and Display frame durations and latencies). And, it works across all graphics APIs and supports UWP applications.
PresentMon is an open-source tool developed by Intel and available on GitHub*.
Monitor power usage on Intel® Core™ processors with Intel Power Gadget (figure 5). The tool runs on Windows and Mac OS X* and includes an application, driver, and libraries to monitor and estimate real-time processor package power information.
Figure 5. Intel® Power Gadget uses the integrated energy counters in the processor to monitor an application’s power consumption
To generate a CSV log file with Intel Power Gadget, click the “Start Log” button. A red “Rec” flashes to indicate logging has started. At the end of your workload, click “Stop Log” to complete your data capture, and save the log in your documents folder by default.
Download Intel Power Gadget.
Detect bottlenecks at the frame level and apply real-time experiments on frames with Intel Graphics Performance Analyzers (Intel GPA, figure 6). Download Intel Graphics Performance Analyzers.
Figure 6. Intel® Graphics Performance Analyzers lets you detect and mitigate bottlenecks at the frame level
When optimizing VR applications, use the key metrics in the "Key performance metrics" section to gauge initial performance and track the impact of changes at each level.
Defined as the number of frames rendered each second, commonly referred to as fps (frames per second). The higher the fps, the better.
LSR applies to Windows Mixed Reality (WMR) applications only. Capture LSR to determine whether an image can be re-projected before being rendered to the headset. This helps prevent motion sickness. The higher the LSR, the better.
This means the total time to render a frame. For a target frame rate of 90 fps, the work must be completed in 11.1ms, or 16.6ms for 60 fps. The lower the frame time, the better.
To determine the budget of time required for a certain fps, use the calculation:
1000 / (Target fps) = X ms
For example: 1000 / 90 fps = 11.1 ms
The total work handled by the processing units over a given period of time.
Having a firm understanding of the key performance metrics of a VR application, and knowing the maximum achievable performance of your target hardware, determines your performance ceiling. Calculate this ceiling for various metrics to put the actual measured performance in context. For example, if you achieve frame rates and Late Stage Reprojection (LSR) values that reach or exceed 98 percent of the target platform and hardware spec's performance ceiling, your application performs well. Put another way, with a maximum achievable refresh rate of 90Hz, you could attempt to achieve at least a frame rate of 88.2 fps and LSR average. The term Late Stage Reprojection is coined by Microsoft and is used for Windows Mixed Reality applications.However, Oculus* and Vive* applications have their own reprojection nomenclature and solutions. For the purposes of this guide, we will refer to LSR for WMR applications to describe ideal performance.
Many of the major VR platforms offer different levels of performance (see table 3).
Microsoft, for example, defines two tiers of Mixed Reality PCs: Windows Mixed Reality Ultra PCs (WMR Ultra PCs) and Windows Mixed Reality PCs (WMR Mainstream PCs) – the key differences being their minimum hardware requirements and the maximum achievable headset frame rate. WMR Mainstream PCs support 60Hz and WMR Ultra PCs support 90Hz.
For other minimum hardware requirements, see the Oculus* Rift support page and the list of recommended specs for HTC Vive*.
|WMR Mainstream PCs||WMR Ultra PCs||Oculus Rift||HTC Vive & Vive* Pro|
Once you’re familiar with the key performance metrics and general indicators that your application runs as intended, start testing based on minimum specifications. Your goal is to deliver a positive user experience on each target platform.
Benchmarking and tuning will determine if there are major performance issues with your VR application. Figure 7 shows the general flow of the analysis and tuning process. The tools employed are shown at the top of the diagram.
Figure 7. Flowchart of VR application analysis and tuning
Use PresentMon to reveal how the application behaves, and then determine the need for further optimizations. Generate graphs with the PresentMon data (figure 7) to understand application behavior throughout your test. The graphical representation of this data lets you identify any instantaneous buffering and stuttering issues, as well as performance spikes. Keep in mind that taking an average of these metrics may not reveal specific issues and how they impact the user experience.
Notice that while the average frame-rate is above 55 fps (figure 8), there are frequent spikes. The large spikes represent a significant number of frames dropping during that period, which translates into stuttering and poor UX.
Figure 8. Graph generated based on collected PresentMon data
Figure 9. Graph generation in Excel*
Determining performance issues becomes simple when looking at the App fps graph (figure 10). If the application performs in an ideal manner, the fps average will meet or exceed 98 percent of the max achievable fps, and the fps standard deviation will be low.
Figure 10. An example of an ideal application where the max achievable fps is 60 fps with a stable 10s average
While the average fps could be in the high 50s, you may experience intermittent frame drops. Figure 11 shows:
(a) frames dropping every 90 seconds due to a dynamic quality setting change.
(b) frames dropping and then stabilizing after a few seconds, illustrating likely buffering issues.
Figure 11a. Frames dropping every 90 seconds
Figure 11b. Buffering issues that smooth over time
If your application misses the fps target on a consistent basis, use Intel GPA to investigate.
Capture frames using Intel GPA to continue analyzing your VR application. Identify the largest ergs. This helps reveal:
In VR applications, ergs may appear twice. This is because they're rendered once per eye, unless single-pass stereo rendering was used, in which case the erg will appear only once. We recommend using single-pass stereo rendering to lower CPU and GPU utilization. Read more about single-pass stereo rendering from Unity*
Figure 12 shows a common scenario in which a media player menu is being rendered behind the video, which the user will never see. GPA is able to show the draw calls and the render target, confirming not only that the player was being drawn, but also that the user never sees it in the final output.
Figure 12. A media-player menu being drawn behind video
Figures 13a, 13b, and 13c reveal how Intel GPA can detect a static RenderScale that's been set too high for the target hardware. We noticed in this instance that the RenderScale had been set to 1.3 and then downscaled to the target size. Changing the RenderScale back to 1.0 showed a significant performance increase.
In fact, on lower-end hardware, dynamically lowering the RenderScale to 0.7 or slightly higher can result in minimal quality degradation (depending on the use-case) while being able to maintain a stable fps, which is more important in VR. Watch this video to learn more about fighting VR sickness.
Figure 13a. Resolution goes from 1664x1664 to 1280x1280 between render targets, which indicated that the RenderScale must have been higher than 1.0
PresentMon data shows the increased performance after reverting the RenderScale to 1.0.
Figure 13b. With a RenderScale of 1.3, frame-rates average 48.9 fps
Figure 13c. After adjusting RenderScale to 1.0, frame-rates average 58.6 fps
Using Intel GPA, look at the largest ergs to examine the draw call and understand what it's doing. Then, look for the bottleneck—for example, does the shader stall due to large textures? Finally, experiment on the erg to see if the change improves performance. Learn more about how to pinpoint performance bottlenecks within a frame in this guide to Get Started with Intel GPA.
If optimizations at the application level using GPA do not meet your requirements, a system-level exam with Windows Performance Recorder may help uncover issues.
Record data during testing with WPR and analyze it with WPA. WPA shows CPU and GPU usage along with a graphical representation of events. In addition to GPU view and Xperf, WPA and WPR come with the Windows Performance toolkit, available as part of the Windows ADK.
Be sure to select the GPU activity option under Resource Analysis in the WPR GUI as this is not selected by default.
Figure 14 shows a WPA trace captured when an app named WWAHost.exe (highlighted in yellow) was running. The CPU Usage (sampled) field shows % weight for the selected executable at 10.56 percent. You'll find GPU usage under the Video twirl-down menu in the Graph explorer on the left.
Figure 14. WPA lets you analyze CPU and GPU usage
As performance engineers, we help other developers optimize their applications. The following list highlights some of the techniques most relevant to optimizing VR applications.
Look for potential culprits. The portal and unnecessary elements may be rendering in the background as well as unused textures being loaded, impacting performance. Use Graphics Frame Analyzer to look for the textures being fetched, the draw call batching, unnecessary clear calls and determine if essential details are being rendered to the screen as well as the draw calls with the biggest impact.
Focus both on changing the X and Y metrics and on GPU duration in the Frame Analyzer, as shown below. This lets you detect which draw calls are taking the longest to render. Experiment to pinpoint the pipeline bottleneck.
Figure 15. Change the X and Y dropdown to show GPU duration to reveal which ergs are taking the most time.
Figure 16 shows a compositor pushing GPU utilization to almost 100 percent. This indicates a GPU-bound application. In such scenarios, check CPU usage. If it’s not at capacity, try offloading the GPU workload to the CPU.
Figure 16. When you see GPU utilization at nearly 100 percent (highlighted in orange), it means your application is GPU-bound
Many factors impact VR application startup time—numerous large asset files take an enormous toll. Even with high-performance hardware, it might take 30-to-60 seconds or more to launch a VR app. Let's explore a startup-time problem scenario, its root cause, and a workaround.
This problem can show up anywhere in a VR application, but it's most painful during startup. Consider figure 17, an example we collected with WPR and viewed in WPA.
Figure 17. Serialized network operations can add significant latency to a VR application’s startup time
In this example, a nine-second window with an idle CPU and GPU dominates the startup time. This red flag means the bottleneck is something other than the processors. The developers had no idea anything was out of the ordinary—they assumed there was no way to reduce startup time.
This illustrates an important rule: a well-tuned application should always be CPU- or GPU-limited, the only exception being to achieve a desired user-experience before hitting a bottleneck.
Once we identified the behavior with WPA, the root cause of the nine-second gap was easy to find. Analysis of the source code revealed the culprit was an update operation. The intention was to run it on its own thread, but it was being run on the UI thread. The workaround took 10 minutes and three lines of code to correct. We spawned the thread to offload the update operation and unblock the UI thread.
That work resulted in the startup profile in figure 18.
Figure 18. Eliminating the source of the startup bottleneck reduced the startup time by 11 seconds
Note the gap in CPU usage is gone. The startup time dropped from ~26sec to less than 16sec. We achieved this by performing the update operation asynchronously.
Always parallelize network operations on separate threads to ensure that they won't block the main or UI thread during startup unless required. Sometimes the best optimization opportunities are hiding in plain sight.
Problem: The application is showing high CPU usage for video playback.
Solution: Check that you're using hardware-decode using Task Manager by going into the Performance tab (see figure 19).
Figure 19. No hardware decode (above left); hardware decode on (above right)
If hardware acceleration is unavailable in your editor, check the settings for an option. In Unity, it is recommended you use an H.264 source. If there’s no hardware decode with your H.264 video in Unity, use this sample to integrate it into your application.
This guide is a starting point to introduce VR application developers to a few basic methods, tools, and techniques for optimizing VR applications. Every application is different in its own way and the recipes and solutions included here might not be suitable for all applications in all cases. Depending on the type of situation and the direction that is shown here, try analyzing the app by using different tools and methodologies described in the above sections. We have listed links to all the tools mentioned in this guide in the appendix. This will remain an excellent reference for those who wish to go deeper into specific optimization topic areas.
|HDR||Disable. RGBA 8 or 10|
|Textures||Low or Medium|
|Single-Pass Stereo Rendering||Enable|
|Render Scale||0.7 - 1.0|
If the vertex shaders are complex with math functions such as pow, exp, log, cos, sin, tan, etc.
Consider using lookup textures as an alternative to complex math calculations if possible in this guide to post-processing with User LUT.
If there is a sampler bottleneck, the texture sampler is starving EUs due to slow retrieval.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804