Efficient Data Layout
The vectorization module transforms scalar data type operations on adjacent
work-items into an equivalent vector operation. If vector operations already
exist in the kernel source code, the module scalarizes (breaks into component
operations) and revectorizes them. Such operation improves performance
by transforming the memory access pattern of the kernel into a structure
of arrays (SOA), which is often more cache-friendly than an array of structures
(AOS).
This transformation comes with a certain cost. Organizing the input
data in SOA reduces the transpose penalty. For example, the following
code example suffers from transpose overhead:
__kernel void sum(__global float4* input, __global float* output) { int tid = get_global_id(0); output[tid] = input[tid].x + input[tid].y + input[tid].z + input[tid].w; }
While the next piece of code does not suffer from this penalty:
__kernel void sum(__global float* inx, __global float* iny, __global float* inz, __global float* inw, __global float* output) { int tid = get_global_id(0); output[tid] = inx[tid] + iny[tid] + inz[tid] + inw[tid]; }
To make the vectorization the most efficient, the sequential work items
should refer to sequential memory locations. Otherwise, data gathering
required for processing in SIMD might be expensive performance-wise. For
example:
int tid = get_global_id(0); output[tid] = inx[tid]; //sequential access (with respect to adjacent work-items) output[tid] = inx[2*tid]; //non-sequential access (triggers data gathering)
There is an alternative to AOS that generally preserves the expressiveness
of AOS and efficiency of SOA. It is sometimes referenced as strided (or
“stripped)” SOA, or even AOSSOA. Consider the code example below:
struct AOSSOA { float [16] x; float [16] y; float [16] z; };