Using Loops

The Intel® Graphics device is optimized for code, which does not branch or loop. In the case, when a loop in a kernel is unavoidable, minimize the overhead by unrolling the loop either partially or completely in code, or using macros, and also minimize memory accesses within the loop.

The following example demonstrates partial unrolling of a loop in the example OpenCL™ kernel. Suppose you evaluate a polynomial, and you know that the order of the polynomial is a multiple of 4. Consider the following example:

__kernel void 
poly(float *in, float *coeffs, float* result, int numcoeffs)
{
        // Un-optimized version
        int gid = get_global_id(0);
        result[gid] = 0;
        for(uint i=0; i<numcoeffs; i++) //numcoeffs is multiple of 4
        {
                result[gid] += pow(in[gid],i)*coeffs[i];
        }
}

The above code is an indeterminate loop—that is, the compiler does not know how many iterations the for loop executes. Furthermore, there are 3 memory accesses within each iteration of the loop, and the loop code must be executed each iteration. You can remove these overheads using partial loop unrolling and private variables, for example:

__kernel void 
poly(float *in, float *coeffs, float* result, int numcoeffs)
{
        // Optimized version #1
        int gid = get_global_id(0);
        float result_pvt;
        float in_pvt = in[gid];
        result_pvt = 0;
        for(uint i=0; i<numcoeffs; i+=4) //numcoeffs is multiple of 4
        {
                result_pvt += pow(in_pvt,i)*coeffs[i];
                result_pvt += pow(in_pvt,i+1)*coeffs[i+1];
                result_pvt += pow(in_pvt,i+2)*coeffs[i+2];
                result_pvt += pow(in_pvt,i+3)*coeffs[i+3];
        }
        result[gid] = result_pvt;
}

In this optimized version, we divide the number of iterations by 4, and do only one memory access per original iteration. In any case where memory accesses can be replaced by private variables, this provides significant performance benefit. Furthermore, if multiple similar memory accesses are occurring in different kernels, then using shared local memory might provide performance gain. See section “Kernel Memory Access Optimization Summary” for details.

Another way to promote loop unrolling is to use macros to set constant loop iterations. The modified code:

__kernel void 
poly(float *in, float *coeffs, float* result, int numcoeffs)
{
        // Optimized version #1
        int gid = get_global_id(0);
        float result_pvt;
        float in_pvt = in[gid];
        result_pvt = 0;
        for(uint i=0; i<NUMCOEFFS; i++)
        {
                result_pvt += pow(in_pvt,i)*coeffs[i];
        }
        result[gid] = result_pvt;
}

And from the host code, when compiling, use the flag:

-DNUMCOEFFS=16 // where 16 is the number of coefficients

It is possible when the loop iterations are known in advance, but you can also use this optimization to define the number of partial unrolls to use, in the case when you know a common denominator for all loop iterations.

When within a loop, use uint data types for iterations, as the Intel® Graphics is optimized for simple arithmetic (increment) on unsigned integers.

For more complete information about compiler optimizations, see our Optimization Notice.