This whitepaper introduces the web and Chrome* graphics rendering pipeline, discusses how to explore Intel® architecture advantages, and talks about the work we have done to solve the texture upload problem and the benefits we have found in doing so. This whitepaper is intended for a broad, enthusiastic technical audience.
Intel's Open Source Technology Center (OTC) has been working together with Google so users can transparently benefit from this optimization on Intel architecture in the next versions of Chrome OS*. Savvy users and developers can try this already by using the following flags when initializing Chrome: --enable-native-gpu-memory-buffers --enable-zero-copy
Motivation: using CPU and GPU for web graphics
When talking about modern graphics interfaces on the Web Platform, special care has to be taken when dealing with computational resources. What used to be robust machines—those with multiple CPU cores, a dedicated graphics processor (GPU), and a large memory bandwidth—are not necessarily powerful enough to maintain the suggested display device refresh rate of 60 frames per second, which is considered enough for our eyes to perceive a "smooth" interface.
For example, an interactive and complex web application that is constantly rendering needs to:
- paint the new content that has been changed
- draw by compositing the new content with the other elements to be displayed on the screen.
There is another experimental rendering path in Chrome OS to paint through offloading to GPU, but we will not discuss that in this article.
Uploading textures into the hardware
Roughly speaking, we call the bitmap image that is meant to be applied to a 3D model on the GPU the texture, and we call the actual process of moving that bitmap from the CPU to the GPU memory the texture upload.
As a basic rule, the more the textures move around, the more it can hurt the system’s memory and overall performance. However in Intel architecture containing integrated Processor Graphics, we can avoid this upload penalty due to the GPU sharing the same physical memory as the CPU. When applied correctly, this type of architecture can bring significant benefits to graphics applications because it can simplify the texture upload mechanism; in this document we will be using the term cross-domain memory for this architecture feature. In the context of a browser, this means that Chrome can paint web content using a CPU memory-mapped buffer, but will get the content into the GPU memory space without generating copies of textures across the two domains (CPU and GPU).
Chrome has a complex multiprocessed architecture, and it is a challenge to move the same graphics buffer all way from the Renderer process to the Compositor and GPU processes and into the video driver memory without producing any copies. So we can also introduce the term cross-process memory for memory buffers that can live across processes. Therefore we define zero-copy upload as when a graphics buffer has the cross-domain memory and cross-process memory features and also no copies are performed when textures are transferred around.
Chrome OS solution: VGEM and the hardware backed GpuMemoryBuffer
Recently we have been building an infrastructure to perform zero-copy texture uploads in Chrome OS with Intel architecture. The Virtual Graphics Execution Manager (VGEM) kernel subsystem is at the bottom of this infrastructure.
VGEM implements a special type of Linux* DRM file descriptor with fewer privileges than the regular file descriptors. Mode-setting and a few other display control configurations are not possible with VGEM and instead, it simply allows a nonprivileged user process to map a previously allocated graphics buffer. The buffer allocation step happens in another process, the privileged one, through system calls to the DRM submodule Graphics Execution Manager (GEM). The privileged process can later safely export the buffer through the PRIME subsystem so it can be mapped by the nonprivileged process.
The way the allocation and VGEM work together is a good fit for the sandbox architecture of Chrome OS, where the Renderer process (for example, browser tabs) has limited access to the rest of the system, which makes it the de facto cross-process graphics buffer sharing system needed for uploading textures.
The Linux i915 driver, through the userspace MiniGBM library, is responsible to provide GPU allocated buffers to the GPU process, which is a privileged process in Chrome (together with the Browser Process). More importantly, i915 also solves the cross-domain part of the problem by handling CPU coherency with the GPU by internally managing the cache hierarchy and flushing it into memory when needed. This coherency management requires complex logic and knowledge to control the underlying hardware. Thankfully this is all transparent to Chrome.
Right above VGEM, Chrome exposes this cross-process, cross-domain buffer infrastructure through the GpuMemoryBuffer interface. Browser tabs and other Chrome clients use GpuMemoryBuffer to map the buffer into the CPU, so their painter (Skia) can have direct access to the GPU memory that holds textures.
We can summarize the Chrome buffer lifecycle for texture upload through the native GpuMemoryBuffer system in the following steps:
- The Browser process is the broker and is therefore responsible to open the VGEM special file descriptor.
- Allocations per se happen via GPU process in the following manner:
- The Renderer process asks the Browser process to allocate a buffer.
- The Browser process talks to the GPU process.
- GPU process creates the buffer and exports the handle (via PRIME) to the Renderer process.
- The Renderer process maps the buffer and is able to paint into it.
- Internally, the i915 driver takes care of flushing and maintaining the CPU ↔ GPU cache coherency.
Although when placed together the whole mechanism looks somewhat complex, the hardware backed GpuMemoryBuffer makes implementation of zero-copy texture uploads in Chrome OS possible. It is important to note that a regular shared memory-backed GpuMemoryBuffer is instead used as a software “fallback” when native support is missing. Next, let’s take a look at the results through a before and after perspective.
Before and after
Software fallback texture upload, one-copy: The Renderer process paints content on shared memory, and then the GPU process uploads the shared memory to a regular texture. After that, glTexImage2D copies the shared memory to GPU memory inside the GL driver. One copy is needed in this case. This is the Chrome OS default method for texturing.
Native texture upload, zero-copy: Native GpuMemoryBuffer implementation enables the Renderer process to paint content on a imported GPU buffer via VGEM. The GPU process uses the texture bound to the GPU-backed buffer. In this configuration there are no copies in the Compositor, GPU process for the staging buffers, or in the driver.
Native zero-copy in Intel architecture is always beneficial in both performance measurements and memory consumption measurements that we performed. Alternatively, when using the native GpuMemoryBuffer implemented here, one can also enable the native one-copy texture upload path and although this is slightly slower than the zero-copy path, it brings better overall performance and memory usage, among other advantages, compared to the default Chrome OS texturing method (shared memory).
Results on performance and memory usage
Note: Results have been measured using internal Intel analysis and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
Using the Telemetry tool to measure the changes, we see great benefits in frame time for extreme texture upload cases (smoothness.tough_texture_upload_cases). Using an Intel® Core™ i7-5500U on a 2560x1700 monitor, native zero-copy is 38.3% faster than the fallback. On a Celeron® processor 3205U with a 1366x768 monitor, native zero-copy is 30% faster, and on a Celeron® processor 2955U with a 1366x768 monitor it is 16.1% faster.
In terms of memory consumption, native zero-copy is quite favourable as well. Memory consumption (PSS) in the GPU process is about 65% lower with native zero-copy compared to software fallback. In the Renderer process the memory consumed is about 20% lower with native zero-copy.
We have also traced the events internally in Chrome, trying to diagnose the differences between the two configurations. In the diagram below, software fallback shows that both the Renderer process and the GPU process are busy due to the Renderer process being in charge of painting content on the shared memory while the GPU process is uploading to texture. In the same diagram, native zero-copy shows that the GPU process is idle because the texture upload step is not needed in that configuration. This results in power savings and it also explains why rendering intensive web applications, such as WebGL, and content with heavy animations might feel smoother for the user.
Selected changes for zero-copy using native GpuMemoryBuffer
- drm/vgem: import virtual GEM:
- drm/i915: Use CPU mapping for userspace dma-buf mmap():
- ozone: Implement zero/one-copy texture for Ozone GBM:
About the authors
Tiago Vignatti is a programmer in the Graphics and Media team and has been working with industry-wide open source graphics for almost ten years now. Tiago has influenced the development of Intel platforms for Linux and Mesa graphics, the X11 and Wayland systems, and also the Chromium* project.
Dongseong Hwang works for the Web Technologies team, driving the development of core web technologies. Besides being the main graphics specialist in the Crosswalk runtime project, Dongseong is a Blink* and Chromium committer.
This work was supported by Intel's Open Source Technology Center.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.