• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Using Loops

The Intel® Graphics device is optimized for code, which does not branch or loop. In the case, when a loop in a kernel is unavoidable, minimize the overhead by unrolling the loop either partially or completely in code, or using macros, and also minimize memory accesses within the loop.
The following example demonstrates partial unrolling of a loop in the example OpenCL™ kernel. Suppose you evaluate a polynomial, and you know that the order of the polynomial is a multiple of 4. Consider the following example:
__kernel void poly(float *in, float *coeffs, float* result, int numcoeffs) { // Un-optimized version int gid = get_global_id(0); result[gid] = 0; for(uint i=0; i<numcoeffs; i++) //numcoeffs is multiple of 4 { result[gid] += pow(in[gid],i)*coeffs[i]; } }
The above code is an indeterminate loop—that is, the compiler does not know how many iterations the
for
loop executes. Furthermore, there are 3 memory accesses within each iteration of the loop, and the loop code must be executed each iteration. You can remove these overheads using partial loop unrolling and private variables, for example:
__kernel void poly(float *in, float *coeffs, float* result, int numcoeffs) { // Optimized version #1 int gid = get_global_id(0); float result_pvt; float in_pvt = in[gid]; result_pvt = 0; for(uint i=0; i<numcoeffs; i+=4) //numcoeffs is multiple of 4 { result_pvt += pow(in_pvt,i)*coeffs[i]; result_pvt += pow(in_pvt,i+1)*coeffs[i+1]; result_pvt += pow(in_pvt,i+2)*coeffs[i+2]; result_pvt += pow(in_pvt,i+3)*coeffs[i+3]; } result[gid] = result_pvt; }
In this optimized version, we divide the number of iterations by 4, and do only one memory access per original iteration. In any case where memory accesses can be replaced by private variables, this provides significant performance benefit. Furthermore, if multiple similar memory accesses are occurring in different kernels, then using shared local memory might provide performance gain. See section “Kernel Memory Access Optimization Summary” for details.
Another way to promote loop unrolling is to use macros to set constant loop iterations. The modified code:
__kernel void poly(float *in, float *coeffs, float* result, int numcoeffs) { // Optimized version #1 int gid = get_global_id(0); float result_pvt; float in_pvt = in[gid]; result_pvt = 0; for(uint i=0; i<NUMCOEFFS; i++) { result_pvt += pow(in_pvt,i)*coeffs[i]; } result[gid] = result_pvt; }
And from the host code, when compiling, use the flag:
-DNUMCOEFFS=16 // where 16 is the number of coefficients
It is possible when the loop iterations are known in advance, but you can also use this optimization to define the number of partial unrolls to use, in the case when you know a common denominator for all loop iterations.
When within a loop, use
uint
data types for iterations, as the Intel® Graphics is optimized for simple arithmetic (increment) on unsigned integers.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804