Using Vector Data Types

To maximize CPU vector unit utilization, try to use vector data types in your kernel code. This technique enables you to map vector data types directly to the hardware vector registers. Thus, the data types used should match the width of the underlying SIMD instructions.

Consider the following recommendations:

  • On the 2nd Generation Intel® Core™ Processors and higher with Intel® AVX support, use data types such as float8 or double4, so you bind code to the specific register width of the underlying hardware. This method provides maximum performance on a specific platform. However, performance on other platforms and supported Intel processors might be less than optimal.
  • You may use wider data types, such as float16, to transparently cover many SIMD hardware register widths. However, using types wider than the underlying hardware is similar to loop unrolling. This method might improve performance in some cases, but also increases register pressure. Still consider using uchar16 data type to process four pixels simultaneously when operating on eight-bit-per-component pixels.
  • When manually “vectorizing” an original kernel that uses scalar data types (like float) to use vector data types (like float8) instead, remember that each work-item processes N elements (for float/float8 example). Make sure you reduce the global size accordingly, so it is dividable by N.
  • The int8 data type improves performance for the 4th Generation Intel® Core™ processors and higher.

Using this coding technique, you plan the vector-level parallelism yourself instead of relying on the implicit vectorization module (see the "Benefitting from Implicit Vectorization" section). This approach is useful in the following scenarios:

  • You are porting code originally used Intel SSE/AVX/AVX2 instructions.
  • You want to benefit from hand-tuned vectorization of your code.

The following example shows a multiplication kernel that targets the 256-bit vector units of the 2nd Generation Intel Core Processors:

__kernel __attribute__((vec_type_hint(float8)))
void edp_mul(__constant float8 *a,
           __constant float8 *b,
           __global float8 *result)
{
  int id = get_global_id(0);
  result[id] = a[id]* b[id];  
}

In this example, the data passed to the kernel represents buffers of float8. The calculations are performed on eight elements together.

The attribute added before the kernel, signals the compiler, or the implementation that this kernel has an optimized vectorized form, so the implicit vectorization module does not operate on it. Use vec_type_hint to indicate to the compiler that your kernel already processes data using mostly vector types. For more details on this attribute, see the OpenCL™ 1.2 Specification.

See Also

Benefitting from Implicit Vectorization
The OpenCL™ 1.2 Specification at http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

For more complete information about compiler optimizations, see our Optimization Notice.