Do you realize how much performance you are losing out on by not coding for the graphics processing unit (GPU)? Also referred to as “the other side of the chip,” the GPU portion available in many modern-day Intel® processors could be the star of the show for video encoding, image rendering, Fast Fourier Transforms (FFTs), and more. While it has become second-nature for software developers to use parallelization techniques such as vectorized libraries, SIMD intrinsics, and so on, many developers do not realize that the GPU is available as a very capable accelerator. In fact there’s a whole topic of “Heterogeneous Programming” that refers to the process of re-architecting software to efficiently use multiple compute engines with different strengths.
Figure 1 shows a quick history of the development of GPU power within the Intel® Core™ processor.
Figure 1. Development of GPU power on the Intel® Core™ processor.
Code that would run well on the GPU must be specifically written and organized for the GPU. While there are well-established compiler flags available for parallelization for the CPU (-axAVX, -axSSE4.2, -xSSE2, etc.), offloading to the GPU is fundamentally more difficult because it requires a different paradigm than what has been established for CPUs since software development began in the 1940s. Because of this, the CPUs may be reaching performance plateaus, while an entirely separate processor that could be running the algorithms much faster using much less power sits mostly idle.
Because heterogeneous programming requires rethinking algorithms, many developers opt for simply maintaining CPU code with incremental improvements. However, the leap to heterogeneous is very doable if you keep a few heuristics in mind.
If your dreams for what your application can do are limited by CPU performance or power constraints, heterogeneous programming may be for you. Start fine-tuning your application with Intel® VTune™ Amplifier to determine hotspots. However, this is just a first step. Understand what is really required by the algorithm, understand the types of engines you can target, and a more efficient approach will probably emerge.
Intel® Core™ processors are heterogeneous. There are at least three types of engines available on most Intel® processors: CPU, execution unit (EU), and general-purpose GPU (GPGPU), and fixed function for media (see Figure 2).
Figure 2. Evolution of heterogeneous architecture.
Heterogeneous programming requires software developers to have a very good understanding of what algorithms run the best on each of the above components. Are you doing video encoding, decoding, or downsizing? Take a look at what the VDBox/MFX fixed function can do. If there is a match, let the fixed function take it. For this short list, that is the best performance/power option and offloading will leave resources for other work. For algorithm components not on this list, here are some basic characteristics that are well-suited for EU engines:
Here are a few examples of software and algorithms that are well-suited to run specifically on the EUs:
If you have an existing code-base that consists of algorithms that would be well-suited for the GPU, rather than doing a complete re-write, take a look at ROI – for current algorithm speedup as well as all of the new things you could do by taking full advantage of all of the capabilities of the HW.
If you are just beginning with the design of your application, start by determining which algorithms fit into a SIMD format according to the types of algorithms listed above. OpenCL™ kernels work best for data parallel work without branches. There is overhead to launching a kernel, but it may be possible to still offload small calculations by figuring out a way to coalesce them, or it may even be possible to launch once and just have the kernel wait for new input (versus relaunching each time something needs to be calculated).
In order to offload your algorithms onto the GPU, you need GPU-aware tools. Intel provides the Intel® SDK for OpenCL™ and the Intel® Media SDK (see Figure 3).
Figure 3. Intel® SDK for OpenCL™ and Intel® Media SDK Interoperability.
The Intel Media SDK is a cross-platform API that developers can use for creating media applications on Windows* and Linux*. The Intel Media SDK comes with a set of code samples that demonstrate how to use the API to build applications that require fast video playback, encode, processing, media format conversion and video conferencing.
The Intel Media SDK libraries are built on top of Microsoft DirectX*, DirectX Video Acceleration (DVXA) APIs, VA-API and platform graphics drivers. The main focus of the SDK is on media pipeline components that are commonly used and often in most need of acceleration, such as:
OpenCL is the open-standard for parallel programming of heterogeneous systems, allowing developers to leverage the technology and fully customize their solutions. When you boil it down, the Intel Media SDK only does three things: decode, encode, and a create a short list of frame-processing operations. OpenCL opens the door to extend these fixed operations in innovative ways.
A good way to get started with OpenCL development is with Intel® SDK for OpenCL Applications. It includes the Intel® Code Builder for OpenCL™ API, which is a software development tool that enables development of OpenCL applications. If you are working with pre-existing code, your project can be converted to an OpenCL project from within the Intel SDK for OpenCL interface.
OpenCL is developed by multiple companies through the Khronos* OpenCL committee.
Many existing applications may be running optimally on CPUs but still not taking advantage of the full range of capabilities Intel has to offer. Today’s generation of computers have a heterogeneous architecture with a GPU having fixed-function video processing as well as general-purpose EUs for programmable media pipelines.
There are many advantages to heterogeneous programming. While it does require rethinking your algorithms, the payoff can be large in terms of efficiency and new capabilities. Looking at things in new ways can open doors. You no longer need to fit within the same constraints. And it is often the case that the process of rethinking even results in more efficient CPU-only code too.
Software developers have access to two powerful tools that are designed to help them take advantage of the GPU: The Intel Media SDK and the Intel SDK for OpenCL.
Gael Hofemeier is a Senior Software Application Engineer enabling business client and consumer applications. She is an Intel Black Belt and blogger on the Intel® Developer Zone.
OpenCL™( and the OpenCL™ logo are trademarks of Apple Inc. used by permission by Khronos.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804