• 10/30/2018
  • Public Content

Use Branching Accurately

You can improve the performance of the Intel® Core™ and Intel® Xeon® processors by converting the uniform conditions that are equal across all work-items into compile time branches.
According to this approach, you have a single kernel that implements all desired behaviors, and let the host logic disable the paths that are not currently required. However, setting constants to branch on calculations wastes the device facilities, as the data is still being calculated before it discarded. Consider a preprocessor directives-based approach instead - use
Consider the example where the original kernel uses constants for branching:
__kernel void foo(__constant int* src,                                                            __global int* dst,                                                                                         unsigned char bFullFrame, unsigned char bAlpha) {         …         if(bFullFrame)//uniform condition (equal for all work-items         {                                …                 if(bAlpha) //uniform condition                 {                        …                                }                 else                 {                 …                 }         else         {         …         } }
Now consider the same kernel, but with use of compile time branches (“specialization” technique):
__kernel void foo(__constant int* src,                                                            __global int* dst) {         …         #ifdef bFullFrame         {                                …                 #ifdef bAlpha                 {                        …                                }                 #else                 {                 …                 }                 #endif         #else         {         …         }         #endif } }
Also consider similar optimization for other constants.
Minimize or, in best case, avoid using branching in short computations with
, and select built-ins instead of
Move memory accesses that are common to the
blocks outside of the conditional code.
Consider the original code with use of the
if (…) {//condition         x = A[i1];// reading from A         … // calculations         B[i2] = y;// storing into B } else {          q = A[i1];// reading from A with same index as in first clause          …  // different calculations          B[i2] = w; // storing into B with same index as in first clause }
Now consider the optimized code that uses temporary variables:
temp1 = A[i1]; //reading from A in advance if (…) {//condition         x = temp1;         … // some calculations         temp2 = y; //storing into temporary variable } else {         q = temp1;         … //some calculations         temp2 = w; //storing into temporary variable } B[i2] =temp2; //storing to B once

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804