Intel® Compiler 17.0 New Feature: Code Alignment for Loops

By Yuan Chen,

Published:06/30/2016   Last Updated:06/30/2016

Intel® Compiler 17.0 provides new option to control code alignment for loops.

Syntax

Linux OS and OS X:    -falign-loops[=n], -fno-align-loops
Windows OS:    /Qalign-loops[:n], /Qalign-loops-

Arguments
n is the number of bytes for the minimum alignment boundary. It must be a power of 2 between 1 and 4096, such as 1, 2, 4, 8, 16, 32, 64, 128, and so on.  n = 1 does no alignment. If n is not present, an alignment of 16 bytes is used.

Default: no special loop alignment

-fno-align-loops, (Linux)    /Qalign-loops-  (Windows)

How does this work? After compiling with this option you will see “.Align”1 directives inserted before loops in assembly. For example compiling using “–falign-loops=64 –qopt-report –g –O2 –xCORE-AVX2”, you will see “.align 64,0x90” generated right before the vectorized loop. 

..LN38:
        vmovapd   %ymm0, %ymm9                                  #39.13
        .align    64,0x90
..LN39:
                                # LOE rax rdx rcx rbx rbp rdi r8 r9 r11 r12 esi r10d r14d xmm8 ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 ymm9
..B1.6:                         # Preds ..B1.6 ..B1.5
                                # Execution count [2.50e+01]
..L11:                                                          #38.7
                # optimization report
                # LOOP WAS UNROLLED BY 8
                # LOOP WAS VECTORIZED
                # VECTORIZATION SPEEDUP COEFFECIENT 3.878906
                # VECTOR TRIP COUNT IS ESTIMATED CONSTANT
                # VECTOR LENGTH 4
                # NORMALIZED VECTORIZATION OVERHEAD 0.343750
                # MAIN VECTOR TYPE: 64-bits floating point
..LN40:
        .loc    1  38  is_stmt 1
..LN41:
        .loc    1  39  is_stmt 1
        vmovupd   (%r8,%r12,8), %ymm10                          #39.31

Finer Grained Control

Using compiler option to control loop alignment will impact all loops in your program. Sometimes a better way to apply this feature is by a finer grained control over a specific loop.

Syntax

C/C++: #pragma code_align(n)

Fortran: !DIR$ CODE_ALIGN [:n]

Argument n is optional and meant similarly to the argument of -falign-loops or /Qalign-loops option.

This pragma(directive) must precede the loop to be aligned. If the code is compiled with the Qalign-loops:m or -falign-loops=m option, and a code_align:n pragma(directive) precedes a loop, the loop is aligned on a max (m, n) byte boundary.

Example in C:

    for (i = 0; i < rows; i += inc_i) {
#pragma code_align 64
#pragma vector aligned
      for (j = 0; j < cols; j += inc_j) {
            b[i] += a[i][j] * x[j];
        }
    }

Example in Fortran:

!DIR$ CODE_ALIGN :64
    do i = 1, cols, inc_i
        do j = 1, rows, inc_j
            b(i) = b(i) + a(j, i) * x(j)
        enddo
    enddo

Recommended usage

The Loop Stream Detector(LCD)2 in modern Intel® processors can benefit from loop alignment for small loop bodies. Actually compiler optimization may generate such align directives for small loops that perfectly fit LSD automatically, like above example when compile with default option -mSSE2.

For loops much smaller than LSD size, if it is also within a nested loop, using loop align option or finer grained loop alignment may be helpful for visible performance gain.

For example, for a loop like below, the inner loop contains less than 10 instructions. Adding a loop alignment precedes the outer loop gives a better performance:

  do n=1,100
!DIR$ CODE_ALIGN :16
    do i=1,100000,10

      a(:,i) =  b(:,i) + i

    enddo

  enddo

As stated in Intel® 64 and IA-32 Architectures Software Developer's Manuals, we should:

“Use the loop cache functionality opportunistically. For high performance code, loop unrolling is generally preferable for performance even when it overflows the LSD capability.”

Reference

1. Align directive in x86 assembly: http://web.mit.edu/gnu/doc/html/as_7.html

2. Intel® 64 and IA-32 Architectures Software Developer's Manuals

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.