• 10/30/2018
  • Public Content
Contents

Efficient Data Layout

The vectorization module transforms scalar data type operations on adjacent work-items into an equivalent vector operation. If vector operations already exist in the kernel source code, the module scalarizes (breaks into component operations) and revectorizes them. Such operation improves performance by transforming the memory access pattern of the kernel into a structure of arrays (SOA), which is often more cache-friendly than an array of structures (AOS).
This transformation comes with a certain cost. Organizing the input data in SOA reduces the transpose penalty. For example, the following code example suffers from transpose overhead:
__kernel void sum(__global float4* input, __global float* output) { int tid  = get_global_id(0); output[tid] = input[tid].x + input[tid].y + input[tid].z + input[tid].w; }
While the next piece of code does not suffer from this penalty:
__kernel void sum(__global float* inx, __global float* iny, __global float* inz, __global float* inw,  __global float* output) { int tid  = get_global_id(0); output[tid] = inx[tid] + iny[tid] + inz[tid] + inw[tid]; }
To make the vectorization the most efficient, the sequential work items should refer to sequential memory locations. Otherwise, data gathering required for processing in SIMD might be expensive performance-wise. For example:
int tid  = get_global_id(0); output[tid] = inx[tid]; //sequential access (with respect to adjacent work-items) output[tid] = inx[2*tid]; //non-sequential access (triggers data gathering)
There is an alternative to AOS that generally preserves the expressiveness of AOS and efficiency of SOA. It is sometimes referenced as strided (or “stripped)” SOA, or even AOSSOA. Consider the code example below:
struct AOSSOA { float [16] x; float [16] y; float [16] z; };

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804