Native One-copy Texture Uploads for Chrome* OS on Intel® Architecture Enabled by Default

Native one-copy texture upload patches have been merged in upstream Chromium*, and they are enabled by default for Chrome* OS on Intel® architecture. Based on the initial patchsets, it is enabled on all Chrome OS devices based on fifth-generation Intel® Core™ i3, i5, and i7 processors. In Chrome M50 (Feb 26, 2016 for canary release, and Apr 19, 2016 for stable release), all Chromebooks* based on fifth-generation Intel Core processors will have native one-copy texture uploads enabled via an automatic update. Intel's Open Source Technology Center (OTC) has been working with Google to bring the benefits of this optimization on Intel architecture to users.

Once these patches are enabled by default on fifth-generation Intel Core processor-based Chrome OS devices, Intel plans to upstream support for all old and new Intel Core processor-based devices (from first generation Intel Core processors through sixth generation Intel Core processors and beyond) as well as all generations of Intel® Atom™ processor-based devices. Check chrome://gpu to see if this optimization is enabled on your device, as shown in the following picture.

Devices that have native texture uploads enabled appear with Hardware accelerated status for the Native GpuMemoryBuffers item, as shown here. The Tile Update Mode item indicates whether One-copy texture upload or Zero-copy texture upload is used for rasterization in Chromium Compositor. One-copy is the default setting and is explained later in this article.

If you want to try the Zero-copy option, go to chrome://flags and enable Zero-copy rasterizer, as shown in the screenshot below:

Even if you set this flag to Enabled, if the Native GpuMemoryBuffers item shows Software only, Chromium uses the software fallback of zero-copy texture uploads. This is very similar to the software fallback of one-copy texture uploads. In fact, both options upload textures through glTexImage2D.

This article explains the technical details of what one-copy texture upload is, and also explains why one-copy is the default rather than zero-copy.

Glossary

This article follows the terminology from the Chromium glossary. Some key terms are defined below:

Skia: An open source 2D graphics library that serves as the graphics engine for Chromium, Android, and Firefox.
Blink: Web browser engine forked from WebKit.
bitmap: A buffer of pixel values in system memory.
texture: A bitmap meant to be applied to a 3D model on the GPU.
painting: The phase of rendering where Blink makes calls into the Skia API to produce a SkPicture instance that records all Skia operations.
rasterization: The phase of rendering where the bitmaps backing up each layer are filled. The SkPicture is played back on a bitmap.
compositing: The phase of rendering that combines layers’ textures into a final screen image. This is done by the GPU.
raster pool: Pool of threads that executes rasterization tasks.

What are zero-copy texture uploads?

We recommend reading our previous article to gain a greater understanding of the concepts summarized here.

The Chromium raster pool rasterizes web content onto a bitmap and then uploads the bitmap to a texture by calling glTexImage2D. This results in a CPU to GPU copy, which zero-copy aims to eliminate.

A while ago Chromium introduced GpuMemoryBuffer to eliminate CPU to GPU copies. On the implementation side, GpuMemoryBuffer is a Linux DMA buffer on Chrome OS, IOSurface on Mac OS X, and a gralloc buffer on Android. The software fallback uses POSIX shared memory.

Intel® Processor Graphics Architecture has long pioneered sharing DRAM physical memory with the CPU. The GPU and CPU using the same DRAM enables zero-copy buffer transfers between the CPU and GPU. Zero-copy means that no buffer copy is necessary because the physical memory is shared. Some Intel processors use a shared LLC cache to further augment the performance of this memory sharing.

From Chromium’s Compositor perspective, the raster pool rasterizes web content directly on the DMA buffer, and the GPU process uses it as a texture.

Why one-copy, and not zero-copy, is enabled by default

One-copy texture upload uses a DMA buffer as a staging buffer. First, the Chromium raster pool rasterizes web content onto a DMA buffer. Then the GPU process binds the DMA buffer as a texture and copies it to a regular texture by calling glCopyTexImage2D. This is why this process is named one-copy texture upload.

Zero-copy might sound faster and more performant, but Chromium uses one-copy texture uploads by default for two reasons:

  • To use tiled storage texture for compositing.
  • To enable partial raster.

Zero-copy produces a linear storage texture as the final texture, which can unfortunately hurt memory bandwidth, as seen in the next section.

Partial raster is a Chromium optimization that only rasterizes regions of web content that have updated as necessary. As explained in the next section, zero-copy cannot take advantage of this partial raster optimization.

Software fallback and native one-copy texture upload are fundamentally different even though both require one texture copy. The key difference is that software fallback causes a CPU to GPU copy, while native one-copy texture upload causes a GPU to GPU copy.

Texture upload/copy is a critical path in the whole Chromium rendering pipeline. Minor improvements can massively impact overall performance. Native one-copy makes the GPU handle the texture copy. Because the CPU is freed up, the CPU can then take more time to handle other tasks (executing JavaScript, decoding images and video, handling user interactions, etc.). You can see these improvements in the Benchmark section below.

Tiled storage texture with one-copy

Most modern GPUs use tiled storage texture by default. If you create a new texture by calling glTexImage2D, most GL drivers then allocate tiled storage for the texture.

Linear surface layout is the most intuitive way to store texture content. The linear format is best suited for one-dimensional, row-sequential access patterns. However, in textures, vertically-adjacent neighbor access is equally as important as horizontally-adjacent neighbor access. If vertically-adjacent elements fall within different memory pages, the GPU core has to wait to read another memory page rather than reading the GPU cache. These GPU cache misses badly impact performance. From Chromium’s perspective, GPU cache misses happen when Chromium composites layers.

This is why most modern GPUs use a tiled storage texture. For example, Intel® HD Graphics uses a 128 bytes (32 pixels) x 32 rows tile that has a fixed 4 KB size and is aligned to physical DRAM page boundaries. This tiled layout significantly reduces the GPU cache misses that the GPU core suffers when it executes a fragment shader.

This low-level GPU mechanism relates to rendering in Chromium because zero-copy texture upload produces a linear storage texture. GpuMemoryBuffer allocates a linear layout DMA buffer because Skia (the raster painting library) accesses this memory in a raster pool. Skia doesn’t have any idea about the tiling layout. Skia rasterizes web content on the linear layout DMA buffer, which GpuMemoryBuffer allocates. Zero-copy uses the DMA buffer as the final texture. One-copy uses a regular texture as the final texture because it copies the DMA buffer to a regular texture. That is why zero-copy produces the linear texture while one-copy produces the tiled texture.

One-copy requires the texture copy upon rasterization, while zero-copy has more GPU cache misses upon compositing. For static web pages, Chromium only rasterizes web content at the first frame and then composites it multiple times. In this scenario, GPU cache misses with zero-copy can negatively impact performance. In contrast, dynamic web pages might require rasterization for every frame. In that case, zero-copy can be much faster than one-copy because the texture copy is much more resource-intensive than GPU cache misses. Refer to the Benchmark section below.

The Chromium community assumes that rasterizing every frame is rare, and therefore enables one-copy by default.

Chromium compositor and double buffering on each layer

To explain what partial raster is all about and why one-copy works with partial raster, we need to understand how Chromium renders web content on the screen.

Like most modern browsers, such as Firefox*, Safari*, and Edge*, Chromium uses something called Accelerated Rendering, which uses layering as the key concept. Layers can minimize redundant rasterization. Chromium categorizes DOM element groups to rasterize together. Each group is one layer. Each layer consists of a number of tiles as well as geometry information, filter information, etc. Each tile is a texture along with some additional information. Chromium composites the layers every frame so users can see the web content updated on the screen.

Before proceeding, let’s take a look how Multithreaded Rasterization works. Rasterization is one of the most resource-intensive tasks in the browser, so Chromium executes rasterization jobs in a raster thread pool. A raster pool increases throughput but doesn’t help with latency. Chromium attempts to draw 60 frames per second. Therefore, each frame ought to have a 16.7 ms interval. However, the rasterization latency is often longer than 16.7 ms, so let’s discuss how Chromium handles that.

Chromium manages two types of layer trees: an active tree and a pending tree. The active tree is used to composite all layers on screen, while the pending tree is used to prepare the next frame by doing rasterization tasks and layer synchronization with Blink. The active tree and the pending tree swap the layer subtree when the subtree is ready. If the pending tree is not fully constructed in the 16.7 ms time slot, Chromium can show the final pixels on the screen using the active tree before the next display synchronization time (vsync). This is how Chromium can keep drawing 60 frames per second.

We can put all of these things together in the following timeline. The pending tree makes the raster pool fill the texture with content (write the texture). The active tree makes the GPU process composite all textures on the screen (read the texture). As the picture below shows, writing the texture and reading the texture don’t happen at the same time.

In fact, the pending tree gets constructed while, at the same time, the active tree composites the textures. As the pending tree and the active tree manage their own textures separately, this method works well and takes full advantage of multicore CPUs.

This kind of double buffering is applied to each layer. While the pending texture is updated, the active texture is presented.

Partial raster with one-copy

So what is partial raster? You can see layer borders and rasterization areas using the Chromium dev tool. Checkmark the Enable paint flashing and Show layer borders options in the Rendering tab. In the picture below, the cyan lines show the tile borders, and the green box shows where rasterization happens.

When we type characters into the input element, Blink figures out that only the input element needs to be updated. In this case, the rasterization area is smaller than the size of the tiles. However, the raster pool must rasterize the whole tile area because Chromium maintains two sets of textures: the pending and active textures. While the active texture contains the exact pixels outside of the green box, the pending texture doesn’t have any pixel information. So the raster pool has to re-rasterize the redundant pixels outside the green box.

We explained that one-copy texture upload uses a DMA buffer as staging buffer. One-copy manages a fixed number of staging buffers as the LRU cache. The raster pool often reuses the cached buffer that was rasterized for the previous frame. Because the raster pool reuses the cached staging buffer, the raster pool then only needs to rasterize the green area. This is what is known as partial raster in Chromium.

Chromium does not support partial raster for zero-copy, because with zero-copy there is no staging buffer cache or anything similar. Because rasterization is much more resource-intensive than texture copying, Chromium developers have decided to make one-copy the default for the time being.

However, Chromium is an ever-changing project, and this decision might change at any time.
You can check if partial raster is enabled in chrome://gpu, as shown here:

Benchmark

Disclaimer: Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Note: These results have been measured using Intel’s internal analysis and provided here for informational purposes. Any differences in your system hardware, software, or configuration might affect your actual performance.

Chromium has its own performance test suites. There are two important test cases: the dynamic draw case and the static draw case. The dynamic draw case mimics a web application, like a game, where the frames are constantly updated due to animation. The dynamic draw case rasterizes layers and composites layers every frame. The static draw case rasterizes layers only at the first frame and composites layers every frame. Zero-copy is good at the dynamic draw case because texture copying is much more resource-intensive than GPU cache misses. On the other hand, one-copy is optimal for the static draw case because texture copy is required only at the first frame, and linear storage textures cause GPU cache misses on every frame.

We tested both cases on a fifth-generation Intel Core i5 processor with 2560x1700 display, a fourth-generation Intel® Celeron® processor with a 1366x768 display, a second-generation Intel Core i7 processor with a 1920x1080 display, and the Intel Atom processor E3800 family with a 1366x786 display. The performance tests create layers of different sizes depending on the display resolution, so high-end devices can have a slower frame rate than low-end devices. We measure frames per second (FPS) with the Chrome FPS counter, which can be enabled via --show-fps-counter.

Zero-copy has the best performance at the dynamic draw case. Zero-copy is about 67% to 114% faster than software fallback. One-copy is 4% to 31% faster than software fallback.

However, zero-copy is 0% to 27% slower than one-copy and software fallback in the static draw case because of GPU cache misses. One-copy has the same performance as software fallback because both methods produce regular textures as the final texture.

The Chromium community has chosen one-copy texture upload as the default because it has modest performance benefits without any drawbacks. On the other hand, zero-copy texture upload offers significant performance benefits but comes with some drawbacks.

Zero-copy is quite favourable in terms of memory consumption. We measure memory consumption on the rasterization heavy site per our configurations. Memory consumption in Proportional Set Size (PSS) in the GPU process is about 65% lower with zero-copy compared to software fallback. In the Render process, the memory consumption is about 20% lower with zero-copy. One-copy texture upload consumes slightly more memory than zero-copy in the GPU process, which is used by staging buffers.

Zero-copy is quite favourable in terms of power consumption as well. The GPU process doesn't have to perform texture upload tasks for zero-copy. This results in power savings of 35%. One-copy makes the GPU process work less and has power savings of 8.5% compared to software fallback. In the static draw case, zero-copy has power savings of 3.7% compared to software fallback.

We measure power consumption using a power analyzer device on the Chromebook Pixel 2015 after turning off the LCD display. The power analyzer itself has an accuracy of +/- 1%. We have to change our Chromium performance test cases to measure power consumption because the original tests are so power-intensive that they demand more energy than the maximum power throttle. So we change both tests to have only one layer: dynamic test case and static test case.

Conclusion

Intel Processor Graphics Architecture shares DRM between the GPU and CPU to enable zero-copy buffer transfers. Through collaboration with the Chromium community, our test cases show that zero-copy results in performance boosts between 67% to 114%, memory savings of 55%, and 35% power savings.

Native one-copy texture uploads are enabled in fifth-generation Intel Core processors, and soon additional Intel Core and Intel Atom processors will also take advantage of native one-copy as well. Stay tuned!

Meta-issue to enable native GpuMemoryBuffer in Chrome OS on Intel processors
https://code.google.com/p/chromium/issues/detail?id=475633

For more complete information about compiler optimizations, see our Optimization Notice.