Upon kernel compilation, the vectorization module often transforms the kernel memory access pattern from array of structures (AOS) to structure of arrays (SOA), which is SIMD friendly.

This transformation comes with a certain cost, specifically the transpose penalty. If you organize the input data in SOA instead of AOS, it reduces the transpose penalty.

For example, the following code suffers from transpose penalty:

__kernel void sum(__global float4* input, __global float* output)
{
int tid  = get_global_id(0);
output[tid] = input[tid].x + input[tid].y + input[tid].z + input[tid].w;
}

While the following piece of code does not suffer from the transpose penalty:

__kernel void sum(__global float* inx, __global float* iny, __global float* inz, __global float* inw,  __global float* output)
{
int tid  = get_global_id(0);
output[tid] = inx[tid] + iny[tid] + inz[tid] + inw[tid];
}

Take care when dealing with branches. Particularly, avoid data loads and stores within the statements:

if (…) {//condition
        x = A[i1];// reading from A 
        … // calculations
        B[i2] = y;// storing into B 
} else {
         q = A[i1];// reading from A with same index as in first clause
         …  // different calculations
         B[i2] = w; // storing into B with same index as in first clause
}

The following code avoids loading from and storing to memory within branches:

temp1 = A[i1]; //reading from A in advance
if (…) {//condition
        x = temp1; 
        … // some calculations
        temp2 = y; //storing into temporary variable
} else {
        q = temp1; 
        … //some calculations
        temp2 = w; //storing into temporary variable
}
B[i2] =temp2; //storing to B once

See Also

Benefitting from Implicit Vectorization

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.