Avoid Manual Loop Unrolling

Published on September 09 , 2012

Avoid Manual Loop Unrolling

The Intel® Compiler can typically generate efficient vectorized code if a loop structure is not manually unrolled. It is better to let the compiler do the unrolls, and you can control unrolling using "#pragma unroll (n)". Vector-alignment, loop-collapsing, interactions with other loop optimizations become much more complex if the compiler has to "undo" the manual unrolling. In all but the simplest of cases, this refactoring has to be done by the user to get the best performing vector-code.

To add to this, manual loop unrolling tends to tune a loop for a particular processor or architecture, making it less optimal for some future port of the application. Generally, it is good advice to write code in the most readable, straightforward manner. This gives the compiler the best chance of optimizing a given loop structure.

Fortran Example where manual unrolling is done in the source:

  m = MOD(N,4)

  if ( m /= 0 ) THEN
    do i = 1 , m
      Dy(i) = Dy(i) + Da*Dx(i)
    end do
    if ( N < 4 ) RETURN
  end if

  mp1 = m + 1
  do i = mp1 , N , 4
    Dy(i) = Dy(i) + Da*Dx(i)
    Dy(i+1) = Dy(i+1) + Da*Dx(i+1)
    Dy(i+2) = Dy(i+2) + Da*Dx(i+2)
    Dy(i+3) = Dy(i+3) + Da*Dx(i+3)
  end do

It is better to express this in the simple form of:

   do i=1,N

     Dy(i)= = Dy(i) + Da*Dx(i)
   end do

This allows the compiler to generate efficient vector-code for the entire computation and also improves code readability.

C++ Example where manual unrolling is done in the source:

double accu1 = 0, accu2 = 0, accu3 = 0, accu4 = 0;

double accu5 = 0, accu6 = 0, accu7 = 0, accu8 = 0;

for (i = 0; i < NUM; i += 8) {
    accu1 = src1[i+0]*src2 + accu1;
    accu2 = src1[i+1]*src2 + accu2;
    accu3 = src1[i+2]*src2 + accu3;
    accu4 = src1[i+3]*src2 + accu4;
    accu5 = src1[i+4]*src2 + accu5;
    accu6 = src1[i+5]*src2 + accu6;
    accu7 = src1[i+6]*src2 + accu7;
    accu8 = src1[i+7]*src2 + accu8;
accu = accu1 + accu2 + accu3 + accu4 +
accu5 + accu6 + accu7 + accu8;

It is better to express this in the simple form of:

double accu = 0;

for (i = 0; i < NUM; i++ ) {
    accu = src1[i]*src2 + accu;


It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon processors. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to Vectorization Essentials.


Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804