Use Branching Accurately
You can improve the performance of the Intel® Core™ and Intel® Xeon®
processors by converting the uniform conditions that are equal across
all work-items into compile time branches.
According to this approach, you have a single kernel that implements
all desired behaviors, and let the host logic disable the paths that are
not currently required. However, setting constants to branch on calculations
wastes the device facilities, as the data is still being calculated before
it discarded. Consider a preprocessor directives-based approach instead
- use
#ifndef
blocks.Consider the example where the original kernel uses constants for branching:
__kernel void foo(__constant int* src, __global int* dst, unsigned char bFullFrame, unsigned char bAlpha) { … if(bFullFrame)//uniform condition (equal for all work-items { … if(bAlpha) //uniform condition { … } else { … } else { … } }
Now consider the same kernel, but with use of compile time branches
(“specialization” technique):
__kernel void foo(__constant int* src, __global int* dst) { … #ifdef bFullFrame { … #ifdef bAlpha { … } #else { … } #endif #else { … } #endif } }
Also consider similar optimization for other constants.
Minimize or, in best case, avoid using branching in short computations
with
min
, max
, clamp
,
and select built-ins instead of if
and else
clauses.Move memory accesses that are common to the
then
and else
blocks outside of the conditional code.Consider the original code with use of the
if
and else
clauses:if (…) {//condition x = A[i1];// reading from A … // calculations B[i2] = y;// storing into B } else { q = A[i1];// reading from A with same index as in first clause … // different calculations B[i2] = w; // storing into B with same index as in first clause }
Now consider the optimized code that uses temporary variables:
temp1 = A[i1]; //reading from A in advance if (…) {//condition x = temp1; … // some calculations temp2 = y; //storing into temporary variable } else { q = temp1; … //some calculations temp2 = w; //storing into temporary variable } B[i2] =temp2; //storing to B once