Avoid Extracting Vector Components

Consider the following kernel:

__constant float4 oneVec = (float4)(1.0f, 1.0f, 1.0f, 1.0f);
__kernel __attribute__((vec_type_hint(float4)))
void inverter2(__global float4* input, __global float4* output)
{
  int tid = get_global_id(0);
  output[tid] = oneVec – input[tid];
  output[tid].w = input[tid].w;
  output[tid] = sqrt(output[tid]);
}

For this example of the explicit vector code, extraction of the w component is very costly. The reason is that the next vector operation forces re-loading the same vector from memory. Consider loading a vector once and performing all changes, even to a single component, by use of vector operations.

In this specific case, two changes are required:

  1. Modify the oneVec so that its w component is zero, causing only a sign change in the w component of the input vector.
  2. Use float representation to manually change the sign bit of the w component back.

As a result, the kernel appears as follows:

__constant float4 oneVec = (float4)(1.0f, 1.0f, 1.0f, 0.0f);
__constant int4 signChanger = (int4)(0, 0, 0, 0x80000000);
__kernel __attribute__((vec_type_hint(float4)))
void inverter3(__global float4* input, __global float4* output)
{
  int tid  = get_global_id(0);
  output[tid] = oneVec – input[tid];
  output[tid] = as_float4(as_int4(output[tid]) ^ signChanger);
  output[tid] = sqrt(output[tid]);
}

At the cost of another constant vector, this implementation performs all the required operations addressing only full vectors. All the computations can be performed in float8.

For more complete information about compiler optimizations, see our Optimization Notice.