• 10/30/2018
  • Public Content

Efficient Data Layout

The vectorization module transforms scalar data type operations on adjacent work-items into an equivalent vector operation. If vector operations already exist in the kernel source code, the module scalarizes (breaks into component operations) and revectorizes them. Such operation improves performance by transforming the memory access pattern of the kernel into a structure of arrays (SOA), which is often more cache-friendly than an array of structures (AOS).
This transformation comes with a certain cost. Organizing the input data in SOA reduces the transpose penalty. For example, the following code example suffers from transpose overhead:
__kernel void sum(__global float4* input, __global float* output) { int tid  = get_global_id(0); output[tid] = input[tid].x + input[tid].y + input[tid].z + input[tid].w; }
While the next piece of code does not suffer from this penalty:
__kernel void sum(__global float* inx, __global float* iny, __global float* inz, __global float* inw,  __global float* output) { int tid  = get_global_id(0); output[tid] = inx[tid] + iny[tid] + inz[tid] + inw[tid]; }
To make the vectorization the most efficient, the sequential work items should refer to sequential memory locations. Otherwise, data gathering required for processing in SIMD might be expensive performance-wise. For example:
int tid  = get_global_id(0); output[tid] = inx[tid]; //sequential access (with respect to adjacent work-items) output[tid] = inx[2*tid]; //non-sequential access (triggers data gathering)
There is an alternative to AOS that generally preserves the expressiveness of AOS and efficiency of SOA. It is sometimes referenced as strided (or “stripped)” SOA, or even AOSSOA. Consider the code example below:
struct AOSSOA { float [16] x; float [16] y; float [16] z; };

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.