Avoid Manual Loop Unrolling

Compiler Methodology for Intel® MIC Architecture

Avoid Manual Loop Unrolling

The Intel® Compiler can typically generate efficient vectorized code if a loop structure is not manually unrolled. It is better to let the compiler do the unrolls, and you can control unrolling using "#pragma unroll (n)". Vector-alignment, loop-collapsing, interactions with other loop optimizations become much more complex if the compiler has to "undo" the manual unrolling. In all but the simplest of cases, this refactoring has to be done by the user to get the best performing vector-code.

To add to this, manual loop unrolling tends to tune a loop for a particular processor or architecture, making it less optimal for some future port of the application. Generally, it is good advice to write code in the most readable, straightforward manner. This gives the compiler the best chance of optimizing a given loop structure.

Fortran Example where manual unrolling is done in the source:

  m = MOD(N,4)

  if ( m /= 0 ) THEN
    do i = 1 , m
      Dy(i) = Dy(i) + Da*Dx(i)
    end do
    if ( N < 4 ) RETURN
  end if

  mp1 = m + 1
  do i = mp1 , N , 4
    Dy(i) = Dy(i) + Da*Dx(i)
    Dy(i+1) = Dy(i+1) + Da*Dx(i+1)
    Dy(i+2) = Dy(i+2) + Da*Dx(i+2)
    Dy(i+3) = Dy(i+3) + Da*Dx(i+3)
  end do

It is better to express this in the simple form of:

   do i=1,N

     Dy(i)= = Dy(i) + Da*Dx(i)
   end do

This allows the compiler to generate efficient vector-code for the entire computation and also improves code readability.

C++ Example where manual unrolling is done in the source:

double accu1 = 0, accu2 = 0, accu3 = 0, accu4 = 0;

double accu5 = 0, accu6 = 0, accu7 = 0, accu8 = 0;

for (i = 0; i < NUM; i += 8) {
    accu1 = src1[i+0]*src2 + accu1;
    accu2 = src1[i+1]*src2 + accu2;
    accu3 = src1[i+2]*src2 + accu3;
    accu4 = src1[i+3]*src2 + accu4;
    accu5 = src1[i+4]*src2 + accu5;
    accu6 = src1[i+5]*src2 + accu6;
    accu7 = src1[i+6]*src2 + accu7;
    accu8 = src1[i+7]*src2 + accu8;
accu = accu1 + accu2 + accu3 + accu4 +
accu5 + accu6 + accu7 + accu8;

It is better to express this in the simple form of:

double accu = 0;

for (i = 0; i < NUM; i++ ) {
    accu = src1[i]*src2 + accu;


It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to Vectorization Essentials.

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.
Возможность комментирования русскоязычного контента была отключена. Узнать подробнее.