Published:05/24/2019 Last Updated:05/24/2019
This document presents developer guidance and optimization methods for the graphics hardware architecture of Intel® Processor Graphics Gen11. It provides developers best practices to most effectively harness the architecture’s capabilities and peak performance. The document also provides specific API guidance for using the latest graphics APIs on Intel Processor Graphics Gen11.
The intended audience of this guide is developers who seek to optimize their interactive 3D rendering applications for Gen11. It is assumed that the developer has a fundamental understanding of the graphics API pipelines for Microsoft DirectX* 12, Vulkan*, and/or Metal 2*. Gen11 also supports the DirectX 11 and OpenGL* graphics APIs; however, there are performance benefits and lower CPU overhead for applications that use the newer and lower level APIs such as DirectX12, Vulkan, and Metal 2, and also new graphics architecture features that are only available in these APIs.
Gen11 offers improved performance and efficiency over Gen9, and new features such as coarse pixel shading, tile-based rendering, and new display controller features. In addition, Gen11 offers the following improvements over previous generations:
For a more in-depth overview of the Gen11 architecture and new features, see the Intel® Processor Graphics Gen11 Architecture guide.
Gen11 implements a tile-based rendering solution known as position only shading tile-based rendering (PTBR). The motivation of tile-based rendering is to reduce memory bandwidth by efficiently managing multiple render passes to data per tile. In order to support tile-based rendering, Gen11 adds a parallel geometry pipeline that acts as a tile binning engine. It is used ahead of the render pipeline for visibility binning pre-pass per tile. It loops over geometry per tile and consumes visibility stream for that tile. PTBR uses the L3 cache to keep per tile data on die, reducing external memory bandwidth. For more information, refer to the architecture guide or talk with an Intel application engineer to see if your workload will benefit.
Coarse pixel shading, also known in DirectX 12 as variable-rate shading, gives the programmer the ability to vary the shading rate independent from the render target resolution and rasterization rate. Among other use cases, this feature allows developers to reduce the number of pixel shader invocations with content that has slowly varying shading parameters, or for pixels that may be blurred later in the rendering pipeline. The feature enables developers to direct shader operations to the pixels that matter most in their content. This can provide a better visual solution than rendering at a lower resolution and then upscaling, since we preserve the depth and stencil at full pixel rate. Gen11 hardware supports DirectX 12 variable-rate shading (VRS) Tier 1.
Gen11 adds improved support for high dynamic range (HDR) displays. To make use of this capability, refer to the Microsoft document, High Dynamic Range and Wide Color Gamut Overview.
Adaptive Sync is the VESA standard for variable refresh rate displays. This display controller and display feature enables a better experience for the user by reducing tearing and stuttering. Adaptive Sync may also reduce overall system power consumption. Basic requirements for Adaptive Sync are:
The game or 3D application must ensure that its rendering swap chain implements asynchronous buffer flips. On displays that support Adaptive Sync, this results in smooth interactive rendering, with the display refresh dynamically synchronized with the asynchronous swap chain flips. If application and platform conditions are met, the Gen11 driver enables Adaptive Sync by default. There is also an option to disable it using the Intel graphics control panel.
Intel provides two major tools to help improve application performance on both CPUs and graphics processing unit (GPUs): VTune™ Amplifier and Intel® Graphics Performance Analyzer (Intel® GPA), which are both free to download. Other tools such as RenderDoc*, Microsoft PIX*, and Windows* Performance Analyzer may work on Intel platforms and can also provide valuable information into the performance of your application. See the documentation at these links for more information on these tools.
Intel GPA includes powerful, agile tools that enable game developers to use the full performance potential of their gaming platform, including, though not limited to, Intel® Core™ processors and Intel Processor Graphics. Intel GPA tools visualize performance data from your application, enabling you to understand system-level and individual frame performance issues. These tools also let you perform what-if experiments to estimate potential performance gains from optimizations.
The Intel GPA System Analyzer is a real-time tool that displays CPU, graphics API, and GPU performance metrics. System Analyzer can help you quickly identify key performance opportunities and whether your workload is CPU or GPU bottlenecked—allowing you to focus optimization efforts on elements that have the most performance impact on your application. With the tool, you can use state override experiments to conduct a fast, high-level, iterative analysis of your game, all without changing a single line of code. System Analyzer is supported for DirectX and OpenGL applications.
Figure 1. System Analyzer shows real-time status for graphics applications and at a system level
With System Analyzer you can:
Figure 2. The head-up display overlay shows real-time metrics
For more information see the System Analyzer Getting Started Guide.
The Intel GPA Graphics Frame Analyzer is a powerful, intuitive, single frame and multiframe (DirectX 11, DirectX 12, and Vulkan) analysis and optimization tool for major graphic API workloads. It provides deep frame performance analysis down to the draw call level, including shaders, render states, pixel history, and textures. You can conduct what-if experiments to see how changes iteratively impact performance and visuals without having to recompile your source code.
Figure 3. Intel® GPA Graphics Frame Analyzer
With Graphics Frame Analyzer you can:
The Intel GPA Graphics Trace Analyzer lets you see where your application is spending time across the CPU and GPU. This helps to ensure that your software takes full advantage of the processing power available from today’s Intel® platforms.
Figure 4. Intel® GPA Graphics Trace Analyzer
Graphics Trace Analyzer provides offline analysis of CPU and GPU metrics and workloads with a timeline view for analysis of tasks, threads, major graphics APIs, and GPU-accelerated media applications in context.
With Graphics Trace Analyzer, you can:
For additional information and to see up-to-date details about Intel GPA, visit the product page.
VTune Amplifier can assist not only with finding CPU or GPU bottlenecks, it can also help to optimize work being done by the CPU when tuning for performance. For tuning performance on DirectX applications, VTune Amplifier can detect slow frames and detect DirectX events. VTune Amplifier also supports customization of tracing events through its Frame and Event APIs. For more information on how to set up and use VTune Amplifier, see the extensive documentation on VTune Amplifier in the Intel® VTune™ Amplifier 2019 User Guide.
While the scope of this guide is only limited to performance optimizations on Gen11, this guide provides an overview of key features that are helpful for developers when tuning performance on workloads that are more graphical in nature, such as gaming applications. See the VTune™ Amplifier User Guide for more extensive documentation and training on VTune Amplifier to help with CPU related bottlenecks and how to use some of the highlighted features below.
VTune Amplifier has the ability to detect slow frames and detect DirectX events. Using this, you can identify slow frames and filter the events in that frame.
Using the Frame and Event API in VTune Amplifier allows for the customization of tracing events that can be profiled by VTune Amplifier. The Frame API allows for this to be done on a per frame basis. For instance, in a game application, it is best to surround the game loop with begin and end events to profile per frame. More information and examples on the VTune Amplifier Frame API can be found in the Frame API documentation.
The Event API allows for custom demarcation of events in your software for profiling. For instance, to help optimize tasks in a game, you could use the Event API to track individual tasks needed to compute a frame in a gaming application. More information and usage on the VTune Amplifier Event API can be found in the Event API documentation.
Modern graphics APIs like DirectX 12, Metal, and Vulkan give developers more control over lower level choices that were once handled in driver implementations. Although each API is different, there are general recommendations for application developers that are API independent.
When configuring pipeline states, consider the following:
Modern Graphics APIs give you more control over resource binding, such as with DirectX Root Signatures and Vulkan Pipeline Layout. Using these requires particular attention to maximize performance. When designing an application strategy for resource binding, employ the following guidance:
The following application guidelines ensure the efficient use of bandwidth with render targets:
When dealing with resources that have both read and write access in a shader, such as UAVs and shader storage buffer objects (SSBOs), consider the following:
To get the best performance when performing multisample anti-aliasing, the following are recommended:
We do recommend using optimized compute shader post-processing anti-aliasing such as Conservative Morphological Anti-Aliasing 2.0.
Each resource barrier generally results in a cache flush or GPU stall operation, affecting performance. Given that, the following guidelines are recommended:
When working with command queues and buffers, the following are recommended:
For the best performance on clear, copy, and update operations, follow these guidelines:
Ensure that vertex and geometry shader functions operate optimally by the following guidelines:
To ensure the most efficient use of Gen11 PTBR hardware, follow these guidelines for bandwidth limited passes:
When writing shaders, look for these opportunities to optimize:
Extended math and sampling operations have a higher weight and may be worth branching (see Figure 6 for issue rate).
Instruction | Single Precision (ops/EU/clk) | Theoretical Cycle Count |
---|---|---|
FMAD | 8 | 1 |
FMUL | 8 | 1 |
FADD | 8 | 1 |
MIN,MAX | 8 | 1 |
CMP | 8 | 1 |
INV | 2 | 4 |
SQRT | 2 | 4 |
RSQRT | 2 | 4 |
LOG | 2 | 4 |
EXP | 2 | 4 |
POW | 1 | 8 |
IDIV | 1 – 6 | 1.33 – 8 |
TRIG | 2 | 4 |
FDIV | 1 | 8 |
Table 1. Gen11 EU Instruction Issue Rates
To get the best performance out of textures and texture operations, please consider the following items:
When defining shader constants, the following guidelines can help to achieve better performance:
Each thread on an execution unit has its own set of registers to store values. The more work that can be done using register to register operations can help reduce memory penalties. However, if there are more temporary variables than registers, some of those variables will have to be stored in memory, where reading and writing have a latency cost. Avoiding this spillover can help to improve performance.
When writing shaders, the following guidelines should be considered to help reduce spillover and improve performance:
When developing compute shaders, the following guidelines can help to achieve optimal performance when selecting thread group sizes:
When developing compute shaders that use SLM, consider the following:
Gen11 supports the use of wave intrinsics for both 3D and compute workloads. These can be used to write more efficient reductions and influence registers to reduce reliance on global or local memory for communication across lanes. This allows threads within the thread group to share information without the use of barriers and to enable other cross lane operations for threads in the same wave. While working with wave intrinsics on Gen11, consider the following:
When presenting frames, it is best to use full-screen presentation modes, when possible. Windowed and other modes require an extra context switch.
Mobile and ultra-mobile computing are ubiquitous. On these platforms, power is shared between CPU and GPU, so optimizing for CPU can frequently result in GPU performance gains. As a result, battery life, device temperature, and power-limited performance have become significant issues. As manufacturing processes continue to shrink and improve, we see improved performance per-watt characteristics of CPUs and processor graphics. However, there are many ways that software can reduce power use on mobile devices, as well as improve power efficiency. In the following sections, you will find insights and recommendations illustrating how to best recognize these performance gains.
Processors execute in different power states, known as P-states and C-states. C-states are essentially idle states that minimize power draw by progressively shutting down more and more of the processor. P-states are performance states where the processor consumes progressively more power and runs faster at a higher frequency.
These power states define how much time the processor is sleeping and how it distributes available power when active. Power states can change very quickly, so sleep states are relevant to most applications that do not consume all the power available, including real-time applications.
When you optimize applications, try to save power in two different ways:
You can determine the power state behavior of your application by measuring how much time it spends in each state. Since each state consumes a different amount of power, you will get a picture over time of your app’s overall power use.
To start, begin by measuring your app’s baseline power usage in multiple cases and at different loads:
The worst-case load may not occur where you expect it to. We have seen very high frame rates (1000 frames per second (fps)) during cut-scene video playback in certain apps, a situation that can cause the GPU and CPU to use unnecessary power. As you study your application, try a few of these tips:
Use the data gained through these methods to reduce or consolidate wakeups, thus remaining in a lower power state longer.
As you study power at near idle, watch for very high frame rates.
If your app has high frame rates at near idle power (during cut scenes, menus, or other low-GPU intensive parts), remember that these parts of your app will look fine if you lock the present interval to a 60 Hz display refresh rate (or clamp your frame rate lower, to 30 fps).
Watch for these behaviors in menus, loading screens, and other low-GPU intensive parts of games, and scale accordingly to minimize power consumption. This can also improve CPU intensive loading times, by allowing turbo boost to kick in, when necessary.
While in active states, the processor and the operating system jointly decide frequencies for various parts of the system (CPUs, GPU, and memory ring, in particular). The current generation of Intel Core processors add more interaction between the operating system and the processor(s) to respond more efficiently and quickly to changes in power demand—a process referred to as Intel® Speed Shift Technology.
The system balances the frequencies based on activity and increases frequency (and thus consumed power) where it is needed most. As a result, a mostly active workload may have its GPU and CPU balance frequencies based on power consumption.
Reducing the amount of work done on the CPU can free up power for the GPU and vice versa. This can result in better overall performance, even when the other side was the primary performance bottleneck.
Tools such as Intel® Power Gadget can also help you see the frequencies of each clock domain in real time. You can monitor the frequencies of different subsystems on target devices by running this tool.
You can tell that your app’s power distribution is getting balanced when the primary performance bottleneck is not running at full frequency but power consumption is reaching the maximum limits available.
There are times when the user explicitly requests trading performance for battery life, and there are things you can do to more effectively meet these demands. There are also patterns in application usage that always consume extra power for little return, patterns that you can more effectively address to handle overall power usage. In the next sections you will see some issues to watch for when trying to reduce overall power consumption.
It was once necessary to poll for power settings and profile (for example, GetSystemPowerStatus()), but since the launch of Windows Vista, Windows supports asynchronous power notification APIs.
If you run as slow as you can (but still remain responsive) then you can save power and extend battery life.
There are several other related points to watch:
Balanced threading offers performance benefits, but you need to consider how it operates alongside the GPU, as imbalanced threading can also result in lower performance and reduced power efficiency. Avoid affinitizing threads so that the operating system can schedule threads directly. If you must, provide hints using SetIdealProcessor().
Using SIMD instructions, either through the Intel® SPMD Program Compiler (Intel® ISPC) or intrinsics, can provide a significant power and performance boost. These improvements can be even bigger by using the latest instruction set.
However, on Intel Core processors, using SIMD instruction requires a voltage increase in order to power the SIMD architecture block. In order to avoid power increase, Intel Core processors will run at a lower frequency, which can decrease performance for a mostly scalar workload with a few SIMD instructions. For this reason, sporadic SIMD usage should be avoided.
The latest graphics APIs (DirectX 12, Vulkan, and Metal 2) can dramatically reduce CPU overhead, resulting in lower CPU power consumption given a fixed frame rate (33 fps), as shown on the left side in Figure 7. When unconstrained by frame rate the total power consumption is unchanged, but there is a significant performance boost due to increased GPU utilization. See the Asteroids* and DirectX* 12 white paper for full details.
Figure 5. Asteroids* demo—power versus frame rate
Intel regularly releases code samples covering a variety of topics to the developer community. For the most up-to-date samples and links, see the following resources:
Following are descriptions and links to samples that may also be of interest to developers targeting current Intel® systems.
DRR is an algorithm that aims to increase and smoothen game performance by trying to keep a fixed displayed render target resolution, but dynamically varies the resolution that is driving the engine shading.
One of the primary issues inhibiting adoption of DRR is the modification to post-processing pipelines that it requires. With the introduction of DirectX 12 and placed resources we are introducing an updated algorithm implementation that removes the need for most, if not all, post-processing pipeline modifications at the cost of increasing your memory requirements with an additional dynamic resolution render target buffer.
While some graphics optimizations focus on reducing geometric level of detail, checkerboard rendering (CBR) reduces the amount of shading done that is imperceptible. The technique produces full resolution pixels that are compatible with modern post processing techniques, and can be implemented for both forward and deferred rendering. More information, implementation details, and sample code can be found in the white paper Checkerboard Rendering for Real-Time Upscaling on Intel Integrated Graphics.
Conservative Morphological Anti-Aliasing 2.0 (CMAA-2) is an update to the image-based conservative morphological anti-aliasing algorithm. This implementation provides improvements to the anti-aliasing quality and performance of previous implementations. For more information, refer to the Conservative Morphological Anti-Aliasing 2.0 white paper.
Screen space ambient occlusion (SSAO) is a popular effect used in real-time rendering to produce small-scale ambient effects and contact shadow effects. It is used by many modern game engines, typically using 5 to 10 percent of the frame GPU time. Although a number of public implementations already exist, not all are open source or freely available, or provide the level of performance scaling required for both low-power mobile and desktop devices. This is where Adaptive Screen Space Ambient Occlusion (ASSAO) fills needed gaps. ASSAO is specially designed to scale from low-power devices and scenarios up to high-end desktops at high resolutions, all under one implementation with a uniform look, settings, and quality that is equal to the industry standard. For more information, refer to the white paper Adaptive Screen Space Ambient Occlusion.
The GPU Detect sample demonstrates how to get the vendor and ID from the GPU. For Intel Processor Graphics, the sample also demonstrates a default graphics quality preset (low, medium, or high), support for DirectX 9 and DirectX 11 extensions, and the recommended method for querying the amount of video memory. If supported by the hardware and driver, it also shows the recommended method for querying the minimum and maximum frequencies.
The sample uses a configuration file that lists many of the Intel Processor Graphics by vendor ID and device ID, along with a suggested graphics quality level for each device. To maximize performance, you should test some representative devices with your application and decide which quality level is right for each. Be careful with relying only on device ID, as a platform’s performance also depends heavily on the available power, which can be set by the device manufacturer to something lower than the optimal thermal design point.
The Fast ISPC Texture Compressor sample performs high-quality BC7, BC6H, ETC1, and ASTC compression on the CPU using the Intel ISPC to exploit SIMD instruction sets.
Direct3D* Website – DirectX 12 and other DirectX resources
Vulkan – Khronos site with additional resources
Metal 2 – Apple’s developer site for Metal 2
Notices
Twitter: Optimization methods for Intel® Processor Graphics Gen11 for developers to harness the architecture’s capabilities and provide API guidance for the latest graphics APIs.
Summary: Optimization methods for Intel® Processor Graphics Gen11 for developers to harness the architecture’s capabilities and provide API guidance for the latest graphics APIs.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804