• 10/30/2018
  • Public Content
Contents

Benefitting from Implicit Vectorization

Intel® SDK for OpenCL™ Applications includes an implicit vectorization module as part of the program build process. When it is beneficial performance-wise, this module packs several work items and executes them with SIMD instructions. This enables you to benefit from the vector units in the Intel® Architecture processors without writing explicit vector code.
The vectorization module transforms scalar data type operations by adjacent work-items into an equivalent vector operations. When vector operations already exist in the kernel source code, the module scalarizes (breaks down into component operations) and revectorizes them. This improves performance by transforming the memory access pattern of the kernel into a structure of arrays (SOA), which is often more cache-friendly than an array of structures (AOS).
You can find more details in the Intel® OpenCL™ Implicit Vectorization Module overview at http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf and OpenCL™ Autovectorization in Intel SDK for OpenCL Applications version 1.5.
The implicit vectorization module works best for the kernels that operate on elements of four-byte width, such as
float
or
int
data types. You can define the computational width of a kernel using the OpenCL
vec_type_hint
attribute.
Since the default computation width is four-byte, kernels are vectorized by default. If your kernel uses certain vector, you can specify
__attribute__((vec_type_hint(<typen>)))
with
typen
of any vector type (for example,
float3
or
char4
). This attribute indicates to the vectorization module apply only transformations that are useful for this type.
The performance benefit from the vectorization module might be lower for the kernels that include a complex control flow.
To benefit from vectorization, you do not need the
for
loops within kernels. For best results, let the kernel deal with a single data element and let the vectorization module take care of the rest. The more straightforward your OpenCL™ code is, the more optimization you get from vectorization.
Writing the kernel in the plain scalar code is what works best for efficient vectorization. This method of coding avoids many disadvantages potentially associated with explicit (manual) vectorization described in the Using Vector Data Types section.

See Also

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804