• 10/30/2018
  • Public Content
Contents

Vectorization: SIMD Processing Within a Work Group

Intel® SDK for OpenCL™ Applications includes an automatic vectorization module as part of the OpenCL program build process. Depending on the kernel code, this operation might have some limitations. When it is beneficial performance-wise, the module automatically packs adjacent work-items (from dimension zero of the ND-range) and executes them with SIMD instructions.
When using SIMD instructions, vector registers store a group of data elements of the same data type, such as
float
or
int
. The number of data elements that fit in one register depends on the data type width, for example: Intel® Xeon® processor (formerly known Intel® processor code name Skylake) offers vector register width of 512 bits. Each vector register (zmm) can store sixteen float (or alternatively eight double) or sixteen 32-bit integer numbers, and these are the most natural data types to work with Intel Xeon processor. Smaller data types are also processed by 16 elements at a time with some conversions.
A work group is the finest granularity for thread-level parallelism. Different threads pick up different work groups. Thus, per-work-group amount of calculations coupled with right work-group size and the resulting number of work groups available for parallel execution are critical factors in achieving good scalability for Intel Xeon processor.
The vectorization module enables you to benefit from vector units without writing explicit vector code. Also, you do not need
for
loops within kernels to benefit from vectorization. For better results, process a single data element in the kernel and let the vectorization module take care of the rest. To get more performance gains from vectorization, make you OpenCL code as simple as possible.
The vectorization module works best for the kernels that operate on elements of
float
(
double
) or
int
data types. The performance benefit might be lower for the kernels that include a complicated control flow.
The vectorization module packs work items for dimension zero of NDRange. Consider the following code example:
___kernel foo(…) for (int i = 0; i < get_local_size(2); i++)     for (int j = 0; j < get_local_size(1); j++)         for (int k = 0; k < get_local_size(0); k++)                 Kernel_Body;
After vectorization, the code example of the work group looping over work items appears as follows:
___kernel foo(…) for (int i = 0; i < get_local_size(2); i++)     for (int j = 0; j < get_local_size(1); j++)         for (int k = 0; k < get_local_size(0); k+=SIMD_WIDTH)                 VECTORIZED_Kernel_Body;
Also note that the dimension zero is the innermost loop and is vectorized. For more information, refer to the Intel® OpenCL™ Implicit Vectorization Module overview at http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf and Autovectorization in Intel® SDK for OpenCL™ Applications version 1.5.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804