Using the OpenCL™ vector data types is a straightforward way to directly utilize the Intel® Architecture vector instruction set (see the "Using Vector Data Types" section). For instance, consider the following OpenCL standard snippet:
float4 a, b;
float4 c = a + b;
After compilation, it resembles the following C snippet in intrinsics:
__m128 a, b;
__m128 c = _mm_add_ps(a, b);
movaps xmm0, [a]
addps xmm0, [b]
movaps [c], xmm0
However, in contrast to the code in intrinsics, an OpenCL kernel that uses the
data type, transparently benefits from Intel AVX if the compiler promotes
. The vectorization module can pack work-items automatically, though it might be less efficient than manual packing.
If the native size for your kernel requires less than 128 bits and you want to benefit from explicit vectorization, consider packing work-items together manually.
For example, suppose your kernel uses the
vector type. It receives (
) float coordinates, and shifts them by (
__kernel void shift_by(__global float2* coords, __global float2* deltas)
int tid = get_global_id(0);
coords[tid] += deltas[tid];
To increase the kernel performance, you can manually pack pairs of work-items:
//Assuming the target is Intel® AVX enabled CPU
void shift_by(__global float2* coords, __global float2* deltas)
int tid = get_global_id(0);
float8 my_coords = (float8)(coords[tid], coords[tid + 1],
coords[tid + 2], coords[tid + 3]);
float8 my_deltas = (float8)(deltas[tid], deltas[tid + 1],
deltas[tid + 2] , deltas[tid + 3]);
my_coords += my_deltas;
vstore8(my_coords, tid, (__global float*)coords);
Every work-item in this kernel does four times as much work as a work-item in the previous kernel. Consequently, they require only one fourth the number of invocations, reducing the run-time overheads. However, when you use manual packing, you must also change the host code accordingly reducing the global size.
For vectors of 32-bit data types, such as
, use explicit vectorization to improve the performance. Other data types (for example,
) may cause an automatic upcast of the input data, which has a negative impact on performance.
For the best performance for a given data type, the vector width should match the underlying SIMD width. This value differs for different architectures. For example, consider querying the recommended vector width using
parameter. You get vector width of four for 2nd Generation Intel Core™ processors, but vector width of eight for higher versions of processors. So one viable option for vector width is using
so that the vector width fits both architectures. Similarly, for floating point data types, you can use
data to cover many potential architectures.
Using scalar data types such as
is often the most “scalable” way to help the compiler do right vectorization for the specific SIMD architecture.