The Full Check List for Optimized OpenCL* Application

To the Intel® SDK for OpenCL* Applications main page

Content Update: 23rd April 2012

Table of Contents

What is Intel SDK for OpenCL* Applications?

Back to top

Intel® SDK for OpenCL* Applications is a comprehensive software development environment for OpenCL* visual computing applications on 3rd Generation Intel® Core™ Processor Family-based Platforms.

The Intel® SDK for OpenCL* Applications 2012 now supports the OpenCL* 1.1 full profile on 3rd generation Intel® Core™ processors with Intel® HD Graphics 4000/2500. For the first time, OpenCL* developers using Intel® Architecture have the ability to utilize the compute resources of both the Intel® processor and Intel® HD Graphics using a standards-based approach, the OpenCL* standard

You can download a free copy of Intel® SDK for OpenCL* Applications 2012 at http://www.intel.com/software/opencl.

Why to Optimize OpenCL* applications to Intel platforms?

Back to top

OpenCL (Open Computing Language) is a royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL* provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, client computer systems and other parallel systems.

Nevertheless, to make the most of the resources available on the underline hardware probably the most important advice is to use OpenCL* language features and the hardware features and optimize the OpenCL* code to the target device.

This article provides quick look to some tips and tricks that can be used to optimize OpenCL* applications to Intel® Architecture. Tips are categorized to: (1) Platform level (2) Intel® Processors (CPU) (3) Intel® HD Graphics (GPU)

This article is based on the Intel® SDK for OpenCL* Applications OpenCL* Optimization Guide.

OpenCL* Optimization Check-List: Application-level

Back to top

Use Built-In Interoperability With Graphics and Media APIs

Optimization type : Intel® HD Graphics (GPU)

Best suited for visual computing applications, Intel® SDK for OpenCL* Applications provides efficient (for example, zero-copy) interoperability with other APIs, like Microsoft DirectX* or Intel® Media SDK extensions.

Intel® SDK for OpenCL* Applications code samples showcase various interoperability optimizations at OpenCL* Samples page.

For more information, see the Interoperability with other APIs article.

Mapping Memory Objects (USE_HOST_PTR)

Optimization type : Intel® Processors (CPU) + Intel® HD Graphics (GPU)

As the host code shares the physical memory with both CPU and Intel® Processor Graphics OpenCL* devices, you can do one of the following to avoid unnecessary copies:

1. Request the framework to allocate memory on the host

2. Allocate properly aligned memory yourself and share the pointer with the framework.

For more information, see the Mapping Memory Objects (USE HOST PTR) article.

Using Buffers and Images Appropriately

Optimization type : Intel® Processors (CPU) + Intel® HD Graphics (GPU)

On Intel® CPU and Processor Graphics devices, buffers usually perform better than images: more data per read/write operation for buffers with much lower latency.

Images are software-emulated on CPU. So, if your legacy code uses images or depends on image-specific formats, choose the fastest interpolation mode that suffices your needs, for example:

· Nearest-neighbor filtering works well for most (interpolating) kernels

· Linear filtering might decrease CPU device performance. If your algorithm does not require linear data interpolation, consider using buffers instead of images.

To improve performance on Intel® Processor Graphics, see the Using Buffers and Images Appropriately article.

Minimize Data Copying

Optimization type : Platform level

The application should process data “in-place” and minimize copying memory objects. Consider the following example: OpenCL* requires that the global work dimensions be exact multiples of the local work-group dimensions. For a typical image processing task, this is most easily thought of as requiring that the work-groups be tiles that exactly cover a frame buffer. If the global size is different than the original image, you might decide to copy and pad the original image buffer, so the kernel does not need to check every work item to see if it falls outside the image. But this can add several milliseconds of processing time just to create and copy images in and out of OpenCL.

For more information, see the Minimize Data Copying article.

Avoid Needless Synchronization

Optimization type : Platform level

For best performance, try to avoid explicit command synchronization primitives, such as clEnqueueMarker/Barrier. Explicit synchronization commands and event tracking result in cross-module round trips, which decrease performance. The less you use explicit synchronization commands, the better the performance.

For more information, see the Avoid Needless Synchronization article.

Reusing Compilation Results with clCreateProgramWithBinary

Optimization type : Platform level

If compilation time for an OpenCL* program is of concern, consider reusing compilation results. It is typically faster than recreating program from the source, but you should check for the specific program/device.

For more information, see the Reusing Compilation Results with clCreateProgramWithBinary article.

Use Shared Context For Running OpenCL* Code Across Both Intel® CPU and Processor Graphics

Optimization type : Platform level

Intel® SDK for OpenCL* Applications provides a common OpenCL* runtime, which enables you to interface with Intel® Processor Graphics and CPU devices using a single context. You can create a “shared” context with multiple devices and benefits from sharing of commands, resource and synchronization instructions on the different devices seamlessly done by the OpenCL* runtime.

For more information, see the Using Multiple OpenCL* Devices article.

OpenCL* Optimization Check-List: Kernel-level

Back to top

Use Floating Point for Calculations

Optimization type : Intel® Processors (CPU) + Intel® HD Graphics (GPU)

Intel® Advanced Vector Extensions (Intel® AVX) support (if available) accelerates floating-point calculations on CPU, also Processor Graphics device is much faster for floating-point add/sub/mul/ and so on in compare to int.

For more information, see the Using Floating Point for Calculations article.

“Gather4” Rule of Thumb: Loading/Storing Data in Greatest Chunks

Optimization type : Intel® Processors (CPU) + Intel® HD Graphics (GPU)

“Saturating” the bandwidth is very important for graphics processors. Bytes data types load integer data types (DWORDS), but also trigger instructions to pack/unpack data. Using int4/float4 for buffers saves a lot of computes, even if you unpack data manually afterward. In other words, avoid using uchar4/char4.

For more information, see the “Gather4” Rule of Thumb: Loading/Storing Data in Greatest Chunks article.

Applying Shared Local Memory (SLM)

Optimization type : Intel® Processors (CPU) + Intel® HD Graphics (GPU)

Shared Local Memory (SLM) (attributed with __local in OpenCL) type of memory is well-suited for scatter operations that otherwise are directed to global memory. So copy small table buffers (or any buffer data re-used frequently) to SLM.

Applying SLM can improve Processor Graphics data throughput considerably, but it might slightly reduce CPU performance.

For more information, see the Applying Shared Local Memory (SLM) article.

Consider native_ Versions of Math Built-Ins

Optimization type : Intel® Processors (CPU) + Intel® HD Graphics (GPU)

OpenCL* offers two basic ways to trade precision for speed:

  • native_* and half_* math built-ins, which have lower precision, but are faster than their un-prefixed variants
  • Compiler optimization options that enable optimizations for floating-point arithmetic for the whole OpenCL* program (for example, the -cl-fast-relaxed-math flag).

In general, while the -cl-fast-relaxed-math flag is a quick way to get potentially large performance gains for kernels with many math operations, it does not permit fine numeric accuracy control. Consider experimenting with native_* equivalents separately for each specific case, keeping track of the resulting accuracy.

For more information, see the Considering native Versions of Math Built-Ins article.

Branching/Loops Considerations

Optimization type : Intel® Processors (CPU) + Intel® HD Graphics (GPU)

You can improve the performance of both CPU and Intel® Processor Graphics devices by converting the uniform conditions that are equal across all work items into compile time branches.

The approach, which is sometimes referred as Uber-Shader in the pixel shader context, is to have a single kernel that implements all desired behaviors, and to let the host logic disable the paths that are not currently required. However, setting constants to branch on calculations wastes the device facilities, as the data is still being calculated before it is thrown away. Consider a preprocess approach instead, using #ifndef blocks.

For more information, see the Notes on Branching Loops article.

Use Tips and Tricks for OpenCL* kernel Development on CPU

Running OpenCL* code on CPU can provide significant performance for parallel compute applications when basic considerations are applied. Follow well known tips and tricks for OpenCL* kernels development on the Intel® CPU to get best performance.

For more information, see the Why Optimizing Kernel Code Is Important article.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

jogshy's picture

Pls add some tips avoid vload/vstore !