OpenCL™: Advantages of the Heterogeneous Approach

Download PDF


Along with the rise of General Purpose computing on Graphics Processing Units (GPGPU), GPUs themselves are evolving rapidly from fixed-function rasterization engines to more general processors. Today, discrete GPUs are typically connected to the CPU via the PCI Express* (PCIe) bus, which significantly limits the data transfer rate between the devices. Explicit boundaries for different memory spaces/hierarchies and high latency synchronization between devices result in quite a coarse-grained level of abstraction. Most OpenCL workloads today target the GPU only, leaving the CPU to do mainly scheduling, file and network I/O, and other “host” types of orchestration. In this approach the costs of PCIe transfers might be prohibitive if tasks are small and not amortized well by execution speed of a GPU.

In the traditional consumer world of desktops, laptops, and, recently, Ultrabook™ devices, the trend is now moving toward tighter integration with the CPU and GPU sharing the same die and a unified memory hierarchy for both CPU and GPU cores (to some extent). With more power efficient designs, including dynamic voltage and frequency scaling, the heterogeneous approach that utilizes both devices is increasingly appealing.

We analyze how the actual benefits depend on application characteristics like the fraction of parallel work, data dependencies, and requirements for synchronization. Another challenge is the complexity of a runtime system required to support heterogeneous abstractions, for example, the notion of a “shared queue” for automatic task-scheduling between CPU and GPU. A good, first step in this direction has already been made: APIs, like OpenCL*, allow the devices to look identical from a software perspective.

In this article we demonstrate the potential of software optimized for highly-integrated system on a chip (SoC) architectures. We use OpenCL as the API of choice. We also evaluate the performance potential of the heterogeneous approach that utilizes CPU and on-die GPU within the same execution context. We use industry benchmarks to demonstrate the impact of the heterogeneous approach. This article showcases how to use OpenCL to target execution on both the CPU and the Intel® HD Graphics devices. Note that success should be measured per case.  The potential gain from a heterogeneous strategy depends on the compute behavior of the application’s parallel regions, as well as its general workflow. The Intel® SDK for OpenCL Applications includes heterogeneous coding samples for you to further examine.

About the Intel® SDK for OpenCL™ Applications 

The Intel SDK for OpenCL Applications is a comprehensive software development environment for OpenCL applications. The new 2013 version provides OpenCL 1.2 support for 3rd and future 4th generation Intel® Core™ processors for Windows* 7 and Windows 8 operating systems. In this article we use the selected benchmarks and the SDK tools and samples to analyze our specific findings.

More information about the Intel SDK for OpenCL Applications is available at

Basics of OpenCL Support for Heterogeneous Execution

OpenCL implements a master-slave architecture, where the “host” (an OpenCL-enabled application) submits work to one or more devices. When a kernel is submitted for execution, an n-dimensional index space is defined. The application also defines contexts of execution. Each context includes the list of target devices, associated command queues, and memory accessible to the devices. The application can queue commands for kernel execution, operations on memory objects, and event synchronization.

OpenCL allows the application to submit a task to the CPU or GPU, but that choice of device needs to be specified explicitly. Thus, the load-balancing between devices is implemented at the application level.

Using CPU and GPU Devices in the Same OpenCL Execution Context

OpenCL allows a list of devices to be specified for a context. However, shared context does not imply a “shared queue.” The specification requires creation of a separate queue per device. It is the application's responsibility to submit the task to the appropriate OpenCL device queue.

Also resource sharing and synchronization should follow the OpenCL specification requirements. Objects allocated at the context level are shared between devices in the context. This allows, for example, using the same kernel object for both the CPU and GPU. More importantly, both buffers and images created with regular clCreateBuffer/clCreateImage API calls are shared by default. The runtime shipped with the Intel® SDK for OpenCL™ Applications offers zero-copy goodness when the CL_MEM_USE_HOST_PTR flag is used during creation of memory objects (refer to the Optimization Guide section Mapping Memory Objects). Note that if you use shared memory objects, your application must meet certain requirements for memory alignment.

It is important to minimize the time one device spends waiting for another to complete a task. In the case of a data-parallel approach, a potential remedy is to increase the granularity of job distribution. For a task-parallel approach, it might make sense to create a fixed-size pool between producers and consumers and let the devices help each other to progress. So that when the pool is full and the first device is blocked waiting for results from the second, the first device can pick up another task (one that is actually waiting to be processed by the second device) instead of sitting idle. Both approaches are detailed in the next section. It is also important to flush queues with clFlush before waiting for results. Flushing of the respective queues triggers the GPU and CPU devices to execute in parallel (refer to the examples in section Synchronization Caveats of the Optimization Guide).

According to the OpenCL specification, an application must ensure that commands that change the content of a shared memory object finish a command queue before the memory object is used by another command queue. One way to achieve this is by using events to guard the access. Or you can write data (or output kernel results) to the same buffer simultaneously with two devices, using properly aligned, non-overlapping sub-buffers. (See Writing to a Shared Resource.)

Partitioning the Work between the Devices

Using multiple devices requires creating a separate queue for each device. Here we briefly outline potential strategies to partition the work between device command queues.

Assigning work statically according to the statically determined relative speed of devices might result in lower overall performance. However, the speed of a device can be affected by OS/driver scheduling decisions and by dynamic frequency scaling. A better strategy is to allocate work dynamically to adapt to the current load and speed of devices.

There are several possible approaches to dynamic scheduling:

  • Coarse-grain partitioning of the work between CPU and GPU devices
    • For naturally independent data pieces like video frames or multiple image files, consider distributing them to different devices for processing. This approach minimizes scheduling overhead. However, it requires a sufficiently large number of data pieces for good load-balancing. For example, the OpenCL benchmark, SiSoftware Sandra*, uses this approach. (Refer to the Industry Benchmark section below.) This approach might increase the burden to the shared system resources, such as shared last-level cache and memory bandwidth, especially for memory-bounded kernels.
    • Another option is “intra-frame” load-balancing when the data being processed is split into pieces by a number of devices. For example, if it is an input image, the CPU might process the first half, and the GPU process the rest. The actual splitting of the data should be adjusted dynamically, based on how fast the devices complete the tasks. One approach is to keep some sort of performance history for previous “frames.” This is how load-balancing is implemented in the NBody demo by Intel. (See the Heterogeneous SDK samples section below.)
  • Fine-grain partitioning
    • This approach partitions the data into smaller parts that are requested by devices from a pool of remaining work, simulating a “shared queue.” Faster devices request new input faster, resulting in automatic load-balancing. The grain size must be large enough to amortize associated overheads from additional scheduling and kernel submission. The LuxMark* benchmark follows this approach. (See the Industry Benchmark section below.)

When deciding how to split the data between devices, the application logic should take the recommended local/global size granularity of each device into account.

It is also possible to have a task-parallel scheduler. This relatively custom approach relies on deep knowledge of task nature and device capabilities. For example, in a multi-kernel pipeline some types of kernels might always be scheduled to run on the GPU. Kernels that use an “image” OpenCL memory object, which utilizes the HW texture capabilities of the GPU, might always be executed on the GPU, with other kernels scheduled on the CPU or both devices. This way, different kernel pipeline stages are assigned to different devices. There are some studies on using machine learning to determine which device is the best fit for different types of kernels, but these are typically either domain-specific or use complex heuristics and/or static kernel analysis.

Basic Frequency/Power Considerations

Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices falling out of turbo frequencies to manage power consumption and heat production. This might result in overall lower performance even compared to a non-heterogeneous, single-device scenario.

Similarly, when working with a GPU device, a high interrupt rate and frequent synchronization with the host can raise the frequency of the CPU and drag down the frequency of the integrated GPU. Using in-order queues can mitigate this because their semantics allow you to specify chains of execution that don’t require host-device round trips for synchronization.

Heterogeneous SDK samples

The important design principles for heterogeneous OpenCL application logic are reflected in the basic Intel SDK for OpenCL sample - HDR Tone Mapping for Post Processing using OpenCL™ - Multi-Device Version.

Intel demoed a heterogeneous version of the NBody application at Siggraph 2012. The demo’s source code is available at the Cross-Device NBody Simulation Sample page. It is a GUI application that offers CPU, GPU, and CPU+GPU configurations with GUI-selectable automatic or static load-balancing.

Performance Results and Wrap-Up

Experimental Setup

In this section, we analyze the performance for three processors that offer integrated CPU and GPU. To cover different technical trade-offs, the first processor is from the desktop marketing segment (Intel® Core™ i7-3770K processor), the second is mobile (Intel® Core™ i7-3720QM processor), hence lower frequencies, and finally the low-power mobile (Intel® Core™ i7-3667U processor). We tested the processors on the machines running the 64-bit version of Windows 7 OS.


Intel® Core™ i7-3770K processor

Intel® Core™ i7-3720QM processor

Intel® Core™ i7-3667U processor

Market segment


Mobile (Laptop)

Mobile (Ultrabook™ device)

CPU Base (Turbo) Freq in Mhz

3500 (3900)

2600 (3600)

2000 (3000)

CPU cores/threads




Integrated GPU

Intel® HD Graphics 4000

Intel HD Graphics 4000

Intel HD Graphics 4000

GPU Base (Turbo) Freq in Mhz

650 (1150)

650 (1250)

350 (1150)

Thermal Design Power (Watts)




Since our focus is on heterogeneous (“platform”) performance, we won’t go into the details of these microarchitectures. What is important for our study is the overall trade-off between CPU, GPU, and CPU+GPU performance for each particular processor.

We used the Intel SDK for OpenCL Applications 2013 Gold and graphics driver (

Industry Benchmarks

We discuss performance of two industry benchmarks that offer support for “platform“ mode with tests that run on the CPU and the GPU simultaneously. LuxMark ( is a ray-tracing benchmark in OpenCL. The benchmark relies on fine-grain task partitioning to load-balance between CPU and GPU devices. In the LuxMark 2.0 test we used, there are five scenes of different complexity (and actually rays divergence as well). We present data for just two of them; results for the rest are similar.

The second benchmark is SiSoftware’s renowned Sandra (ver. 2012 SP5), In contrast to LuxMark, it offers tests from different domains ranging from Cryptography (which is integer-bounded) to floating-point intensive tests. To focus entirely on the scalability of computationally intensive workloads (that should benefit best from using multiple devices), we exclude a couple of tests from the analysis. For example, we excluded the double-precision test as DP is emulated on the GPU, so the platform score exists only for the emulated case. This is a perfect example of the kind of task that is more suitable for executing on the CPU (in the single-device mode) as the CPU offers native support for DP. We also didn’t include memory bandwidth tests because these by nature are not scalable with the number of devices. Specifically, bandwidth tests are pure API calls for memory transfers within the given device or between the particular device and the host, so no actual “help” from another device is possible.

Difference in Platform Score Breakdown for the Ultrabook, Laptop, and Desktop

To give you a flavor of the performance advantages of the heterogeneous approach, we tested CPU, GPU, and platform (CPU+GPU) performance using the benchmarks we just introduced. While the detailed breakdown for each individual benchmark test will follow in the next sections, here is the geometric mean n of all tests for each processor we tried:


Figure 1. Platform (i.e., CPU+GPU) performance compared to the CPU-only and GPU-only cases, for the Intel® Core™ i7-3667U processor, Intel® Core™ i7-3720QM processor, and Intel® Core™ i7-3770K processor. Geometric means over the Sandra*/LuxMark* tests.  The green dashed line at “1.0” denotes no speedup compared to the specific single-device case.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

The first important observation we made is how the platform mode introduces a nice performance delta—1.6-2.8X—better than the conventional single-device approach. While particular gains are different for each processor (depending on the relative speed of the CPU and GPU), the heterogeneous approach gets the most performance out of all three platforms. Thus, the heterogeneous approach is more adaptive to the potential variation of the end-user processors compared to hard-coding the usage of each device.

To understand the difference in the contribution from the CPU and integrated GPU to the platform scores on the different processors, let’s focus on comparing the two extreme cases: a full-blown desktop processor with a powerful CPU vs. a processor typically found in Ultrabook devices that has a lower freq/power budget (refer to the table in the beginning of the chapter).


Figure 2. Contribution of the CPU and GPU to the overall platform for the three processors offering fused CPU+GPU design. Intel® Core™ i7-3667U processor (Ultrabook™ devices, top), Intel® Core™ i7-3720QM processor (mobile-laptops, middle), and Intel® Core™ i7-3770K processor (desktop, bottom). “1” on both charts is the resulting platform performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Given the large number of cores, high operating frequencies, and large power envelope for the Intel Core i7-3770K processor (desktop) that we tested, the CPU generally outperforms the integrated GPU. Thus, it is the CPU running OpenCL that contributes the most to the overall platform score. For the Intel Core i7-3667U processor (Ultrabook devices), the performance picture of the platform mode is more balanced; it is often the GPU that contributes the most to the overall platform score for these specific OpenCL selected tests. The Intel Core i7-3720QM processor (mobile laptops) results are in between the Intel Core i7-3667U and the Intel Core i7-3770K processors.

For future study on the heterogeneous approach, we suggest examining different profiles, like “minimize power” vs “maximize performance”, or more involved trade-offs.


Often OpenCL is considered an API for getting performance only out of GPUs, but that is actually not the case. In the integrated designs where the CPU shares not just the same die but also memory hierarchy with the GPU, it is possible to efficiently utilize both devices via simultaneous execution. This brings substantial performance improvements. Also, architecting a heterogeneous application (so it is capable of running on the CPU and GPU simultaneously with load-balancing) allows covering potential differences in device capabilities between different platforms in a more automatic way. Indeed, instead of relying on the performance of just one type of device, proper dynamic load-balancing logic will more heavily load the devices that are faster in the current configuration. The same logic will quickly re-balance between the devices if needed, e.g., if some devices become occupied with other work, the GPU becomes busy with heavy rendering, or the CPU becomes occupied with some file I/O.

One possible path to provide a more robust and easy to use runtime system for heterogeneous execution of applications is to add the notion of multi-device execution support in today’s programming models. An example is a shared queue (for multiple devices on the same platform). This allows developers to save on complex custom infrastructures. Together with promising performance numbers, this would pave the way for rapid adoption of truly heterogeneous programming in the largest market segments like applications for laptops and Ultrabook devices.

For more information, go to

About the author

Maxim Shevtsov is a Software Architect in the OpenCL performance team at Intel. He got his Masters degree in Computer Science in 2003, and prior to joining Intel in 2005 was doing various academia studies in computer graphics.



For more complete information about compiler optimizations, see our Optimization Notice.