• 2019 Update 4
  • 03/20/2019
  • Public Content

Tips for Auto-Vectorization

Upon kernel compilation, the vectorization module often transforms the kernel memory access pattern from array of structures (AOS) to structure of arrays (SOA), which is SIMD friendly.
This transformation comes with a certain cost, specifically the transpose penalty. If you organize the input data in SOA instead of AOS, it reduces the transpose penalty.
For example, the following code suffers from transpose penalty:
__kernel void sum(__global float4* input, __global float* output) { int tid = get_global_id(0); output[tid] = input[tid].x + input[tid].y + input[tid].z + input[tid].w; }
While the following piece of code does not suffer from the transpose penalty:
__kernel void sum(__global float* inx, __global float* iny, __global float* inz, __global float* inw, __global float* output) { int tid = get_global_id(0); output[tid] = inx[tid] + iny[tid] + inz[tid] + inw[tid]; }
Take care when dealing with branches. Particularly, avoid data loads and stores within the statements:
if (…) {//condition x = A[i1];// reading from A … // calculations B[i2] = y;// storing into B } else { q = A[i1];// reading from A with same index as in first clause … // different calculations B[i2] = w; // storing into B with same index as in first clause }
The following code avoids loading from and storing to memory within branches:
temp1 = A[i1]; //reading from A in advance if (…) {//condition x = temp1; … // some calculations temp2 = y; //storing into temporary variable } else { q = temp1; … //some calculations temp2 = w; //storing into temporary variable } B[i2] =temp2; //storing to B once
See Also

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804