• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Using Vector Data Types

To maximize CPU vector unit utilization, try to use vector data types in your kernel code. This technique enables you to map vector data types directly to the hardware vector registers. Thus, the data types used should match the width of the underlying SIMD instructions.
Consider the following recommendations:
  • On the 2nd Generation Intel® Core™ Processors and higher with Intel® AVX support, use data types such as
    float8
    or
    double4
    , so you bind code to the specific register width of the underlying hardware. This method provides maximum performance on a specific platform. However, performance on other platforms and supported Intel processors might be less than optimal.
  • You may use wider data types, such as
    float16
    , to transparently cover many SIMD hardware register widths. However, using types wider than the underlying hardware is similar to loop unrolling. This method might improve performance in some cases, but also increases register pressure. Still consider using
    uchar16
    data type to process four pixels simultaneously when operating on eight-bit-per-component pixels.
  • When manually “vectorizing” an original kernel that uses scalar data types (like
    float)
    to use vector data types (like
    float8
    ) instead, remember that each work-item processes
    N
    elements (for
    float/float8
    example). Make sure you reduce the global size accordingly, so it is dividable by
    N
    .
  • The
    int8
    data type improves performance for the 4th Generation Intel® Core™ processors and higher.
Using this coding technique, you plan the vector-level parallelism yourself instead of relying on the implicit vectorization module (see the "Benefitting from Implicit Vectorization" section). This approach is useful in the following scenarios:
  • You are porting code originally used Intel SSE/AVX/AVX2 instructions.
  • You want to benefit from hand-tuned vectorization of your code.
The following example shows a multiplication kernel that targets the 256-bit vector units of the 2nd Generation Intel Core Processors:
__kernel __attribute__((vec_type_hint(float8))) void edp_mul(__constant float8 *a, __constant float8 *b, __global float8 *result) { int id = get_global_id(0); result[id] = a[id]* b[id]; }
In this example, the data passed to the kernel represents buffers of
float8
. The calculations are performed on eight elements together.
The attribute added before the kernel, signals the compiler, or the implementation that this kernel has an optimized vectorized form, so the implicit vectorization module does not operate on it. Use
vec_type_hint
to indicate to the compiler that your kernel already processes data using mostly vector types. For more details on this attribute, see the OpenCL™ 1.2 Specification.
See Also
Benefitting from Implicit Vectorization The OpenCL™ 1.2 Specification at http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804