OpenCL™ - Programming for CPU Performance

This white paper is the third in a series of whitepapers on OpenCL™ describing how to best utilize underlying Intel hardware architecture using OpenCL. This white paper will go over programming considerations for host-side device orchestration, as well as OpenCL kernels for CPU.

Disclaimer: This article is based on self-experience as well as on conversations with the OpenCL team at Intel. It will provide you with insights into performance with the current Intel® OpenCL SDK. Intel may support OpenCL on future devices to bring you more performance on the platform, but no announcement has been made on specific platforms and release dates. Nevertheless, you can use today’s guidelines to scale to the next generation of Intel platforms.

The Intel® OpenCL SDK 1.1 implementation for CPU (Intel® Core™2 Duo or later CPUs) can be retrieved from /en-us/articles/opencl-sdk. It is still evolving alongside the OpenCL specification, so feel free to download download the OpenCL™ through Intel® Media Server Studio and Intel® INDE and provide feedback to us at the Intel OpenCL SDK Support Forum. At present, Intel OpenCL SDK 1.1 runs on Linux* 64 bit, Microsoft Windows 7* (with SP1) and Microsoft Windows Vista* operating systems (32-bit and 64-bit).

The inherently heterogeneous nature of OpenCL allows developers to target various devices that might have very different architectures. CPUs are traditionally great for large complex kernels, as they have large out-of-order cores and large caches.

Performance Considerations for Devices

OpenCL programming requires explicit host-side management of queues, contexts, and devices. Thus, to be efficient, the host-side logic needs to incorporate certain architecture knowledge to utilize any given target device in the best way. There are also different strategies available to divide work among multiple devices that also involve work coordination using events and asynchronous callbacks.

Let us first go over a top-level view of an OpenCL program.

OpenCL kernels and the host program both have to make sure that underlying hardware is getting efficiently utilized. Figure 1 tries to point out that communication to discrete graphics devices may be achieved over a PCI-E link, which is about 10x slower than communicating to the CPU using memory/cache hierarchy. If data is needed back at the main program, programmers need to take the costs of data transfer into consideration when evaluating the performance of algorithms.



High-Level Device Selection Considerations

Algorithms and Alternatives (Intel® IPP, Intel® MKL, Intel® Media SDK)
If an algorithm has a large memory footprint, involves table lookups, has a lot of branches, or requires table lookups (dynamic programming), then pick the CPU as the target for such algorithms. OpenCL does not allow recursion, function pointers, variable length arrays, etc., so check the specification to be sure that the algorithm is supported under the OpenCL framework. Algorithms such as constraint solvers usually perform better on CPUs, as these algorithms need conditional statements.

Intel continues to offer developers a choice of proven, innovative parallel programming models. Examples are libraries such as Intel® Threading Building Blocks (Intel® TBB), Intel IPP or Intel MKL (for more details, visit /en-us/articles/intel-tbb/, /en-us/articles/intel-ipp/or /en-us/articles/intel-mkl/). OpenCL augments these tools with low-level standard API support on Intel platform.

The OpenCL compiler is a new technology, so OpenCL applications using OpenCL on the CPU will not always perform as well as applications using highly optimized functions for image, cryptography or signal processing in Intel IPP or Intel MKL (and it is not at the same level of maturity yet as Intel tools such as Intel C/C++ Compilers, Intel IPP and Intel MKL image, signal and crypto processing routines).

For better optimization of Intel hardware for media applications, Intel has also released the Intel® Media SDK for Windows* which is optimized for video processing (scaling, color correction, de-interlacing, cropping and sharpening, etc.) and media encoding/decoding (AVC, MP4, H.264, etc.). Intel Media SDK utilizes Intel® Processor Graphics hardware to expedite video processing and encoding. These capabilities of Intel processor graphics are at present only exposed through the Intel Media SDK. This may be a better alternative if your requirements fall in this category and your machine already has Intel processor graphics. For more details, visit /en-us/articles/media.

Turbo Frequencies
Most modern devices work at turbo frequencies only when there is enough work to be performed. While measuring for performance, make sure that you take measurements at normal and at turbo frequencies so that you can have better idea on performance/power ratios. Turbo mode usually kicks in when there is lot of work queued.

Devices perform at higher performance at lower precision requirements. If your algorithm can work at lower precision, use the -cl_fast_relaxed_math build option when compiling your kernels. For more information, see the Optimization guide at /en-us/articles/opencl-sdk/. Multiply and add options may not work on all device targets (some devices may not support it), so make performance decisions while keeping an eye towards portability.

Avoid Naïve Selection Criteria
Device selection decisions should not be based on number of cores available, as not all cores are equal in capabilities. Selection based on amount and type of work at hand may be better selection criteria. Programmer should also consider data destination (say for audio filter, performing FFT on remote device with data transfer over PCI-E costs may hinder performance) and algorithm at hand to select most appropriate device.





Writing Host Program for Performance

Command Queues
A host program makes several critical design decisions for devices at hand. For CPUs, it is better to use out-of-order queues to enable running multiple kernels simultaneously. If it makes sense from a software design perspective, you can use multiple command queues within the same application. If work needs to be synchronized between command queues, use events and callbacks to synchronize work. To synchronize command queue execution with pre-existing C code, use the clEnqueueNativeTask API or user events. Reading and writing data is lot faster on CPUs. Use mapped buffers backed by properly aligned host pointers to get the best performance. Profile your code using OpenCL profiling capabilities, and measure your performance at every level using Intel Performance Debugging BKMs.

Intel® OpenCL SDK 1.1 allows you to create out-of-order queues. Utilizing out-of-order queues may result in better CPU core utilization for algorithms that involve different concurrent steps.

Memory Objects
CPUs do not have specialized hardware to handle images. If you are writing simple convolutions and image data types are most natural for your problem, use image objects along with samplers. But make sure to use the simplest interpolation mode that suffices for your needs, e.g. many (interpolating) kernels work fine with nearest-neighbor filtering. For more considerations on using images, please refer to Chapter 4.3, “Image Support”, of the Writing Optimal OpenCL code With Intel OpenCL SDK document[1].

Intel OpenCL Implicit Vectorization Module tries to create the most optimal code for such cases. As the Intel® OpenCL SDK Compilermatures, performance of such kernels will improve significantly.

Event callbacks and Multiple Threads
Using events and using callbacks based on event completions help to generate code that is easy to understand but hard to debug. One specific reason is that callbacks are asynchronous, so ensuring data safety across threads during callbacks is non-trivial. When multiple threads are issuing commands and commands have completion callbacks, test under various load conditions such as single core/multicore machines under various load conditions (i.e., 90% busy with other work or just single core running OpenCL application) to weed out hard-to-diagnose thread data safety issues.

Kernel Objects
If you are using threads, multiple command queues and asynchronous event based callbacks, it is better not to share kernel objects, as then you can set kernel arguments as needed in advance without having to worry about synchronization. Note that clSetKernelArg is not thread safe. This strategy is very helpful when you are doing similar work on multiple datasets (say processing 300 pictures or a video multiple frames in advance).

Profiling code should be used as a tool to calibrate data transfer costs and execution costs, along with other fixed costs such as submitted and start time differences for a given command queue. Profiling code should be stripped out when releasing code, as profiling comes at a cost. Read more at Intel Performance Debugging Intro.





Writing OpenCL™ Kernel for Performance

Please refer to the Writing Optimal OpenCL code With Intel OpenCL SDK document[1] for detailed recommendations related to developing OpenCL kernels targeted for CPUs.

Intel® OpenCL SDK 1.1now includes an Implicit Vectorization Module. This vectorizer works best with 32-bit (float, int) data types, refer to chapter 2.6 of the Writing Optimal OpenCL code With Intel OpenCL SDK document[1]. In turn, runtime takes care to execute your job in an efficient and balanced way. This means that for simple image processing type of kernels, programmers do not need to set Global Work Size to number of cores and then have loops in kernels to operate on an image. Instead, providing a sufficient number of Work Groups is preferable; refer to chapter 2.7 of the Writing Optimal OpenCL code With Intel OpenCL SDK document[1]. Programmers should just program in a natural way (i.e., set up global work size to image width/height, etc.) and then in kernel, write code using vector data types as scalar types, as if programs were writing simple scalar code. Auto vectorizer often does AOS (array of structures) to SOA (structures of arrays) translations under the hood to get best performance with SIMD units.

Kernels utilizing vector data-types float4, float8, float16 perform better than kernels written using just scalar floats for the same task. If your algorithm naturally fits with vector data types, use these types. If Image has only RGBs, use float3 data-types (new OCL 1.1 feature). Float3 data-types utilize SIMD units, so they are not as efficient as Float4 which naturally fits to SSE registers.

General recommendations for any device
These are general recommendations which typically help all kernels regardless of target device. These include using vector data types explicitly (though we generally advise to use scalar types and then rely on the vectorizer), using built-in functions, avoiding computations in kernels that can be done once, avoiding branching, avoiding handling of edge conditions in kernels, and using the preprocessor for constants.

Device Fission
This is a preview feature available in Intel® OpenCL SDK 1.1. Using this feature (enabled using cl_ext_device_fission), programmers can create subdevices and then create command queues for those subdevices to queue kernels. This way, all resources are not allocated to a single command queue. There are several modes for device fission, such as divide equally, based on counts, or based on affinity domain (create a subdevice for every NUMA node). Please refer to the Intel OpenCL SDK User’s guide for more details.

To use device fission, always create subdevices before creating contexts, and measure your performance to see if performance of the subdevice with the given number of compute units is acceptable. We will cover this feature more in detail in our next white paper.

Profiling and Debugging
Profile your kernel execution and submission to start times using OpenCL events profiling. This is very helpful when using out-of-order queues to see how well multiple kernels are getting executed. Integration with Intel GPA is also useful for Out-Of-Order queue debugging, as it shows a visual representation of the same information. Debugging is still pretty much print-based, and debugging kernels is still lot easier on CPUs than it is on GPUs. See the Tools introduction at /en-us/articles/introduction-to-intel-opencl-tools.






[1] “Writing Optimal OpenCL™ Code with Intel® OpenCL SDK”, located at /sites/default/files/m/d/4/1/d/8/Writing_Optimal_OpenCL_28tm_29_Code_with_Intel_28R_29_OpenCL_SDK.pdf and at install-dir>\docs\





About the Author

Vinay Awasthi works as an Application Engineer for the Apple* Enabling Team at Intel at Santa Clara. Vinay has a Master’s Degree in Chemical Engineering from Indian Institute of Technology, Kanpur. Vinay enjoys mountain biking and scuba diving in his free time.





This white paper is the third in a series of whitepapers on OpenCL™ describing how to best utilize underlying Intel hardware architecture using OpenCL. This white paper will go over programming considerations for host-side device orchestration, as well as OpenCL kernels for CPU.



Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.