Writing Kernels to Directly Target the Intel® Architecture Processors

Using the OpenCL™ vector data types is a straightforward way to directly utilize the Intel® Architecture vector instruction set (see the "Using Vector Data Types" section). For instance, consider the following OpenCL standard snippet:

float4 a, b;
float4 c = a + b;

After compilation, it resembles the following C snippet in intrinsics:

__m128 a, b;
__m128 c = _mm_add_ps(a, b);

Or in assembly:

movaps xmm0, [a]
addps  xmm0, [b]
movaps [c], xmm0

However, in contrast to the code in intrinsics, an OpenCL kernel that uses the float4 data type, transparently benefits from Intel AVX if the compiler promotes float4 to float8. The vectorization module can pack work-items automatically, though it might be less efficient than manual packing.

If the native size for your kernel requires less than 128 bits and you want to benefit from explicit vectorization, consider packing work-items together manually.

For example, suppose your kernel uses the float2 vector type. It receives (x, y) float coordinates, and shifts them by (dx, dy):

__kernel void shift_by(__global float2* coords, __global float2* deltas)
  int tid = get_global_id(0);
  coords[tid] += deltas[tid];

To increase the kernel performance, you can manually pack pairs of work-items:

//Assuming the target is Intel® AVX enabled CPU
__kernel __attribute__((vec_type_hint(float8)))
void shift_by(__global float2* coords, __global float2* deltas)
  int tid = get_global_id(0);
  float8 my_coords = (float8)(coords[tid], coords[tid + 1],
                             coords[tid + 2], coords[tid + 3]);
  float8 my_deltas = (float8)(deltas[tid], deltas[tid + 1],
                             deltas[tid + 2] , deltas[tid + 3]);
  my_coords += my_deltas;
  vstore8(my_coords, tid, (__global float*)coords);

Every work-item in this kernel does four times as much work as a work-item in the previous kernel. Consequently, they require only one fourth the number of invocations, reducing the run-time overheads. However, when you use manual packing, you must also change the host code accordingly reducing the global size.

For vectors of 32-bit data types, such as int4, int8, float4 or float8, use explicit vectorization to improve the performance. Other data types (for example, char3) may cause an automatic upcast of the input data, which has a negative impact on performance.

For the best performance for a given data type, the vector width should match the underlying SIMD width. This value differs for different architectures. For example, consider querying the recommended vector width using clGetDeviceInfo with the CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT parameter. You get vector width of four for 2nd Generation Intel Core™ processors, but vector width of eight for higher versions of processors. So one viable option for vector width is using int8 so that the vector width fits both architectures. Similarly, for floating point data types, you can use float8 data to cover many potential architectures.


Using scalar data types such as int or float is often the most “scalable” way to help the compiler do right vectorization for the specific SIMD architecture.

See Also

Using Vector Data Types

For more complete information about compiler optimizations, see our Optimization Notice.