Use Pragmas with the Intel® C++ Compiler for Linux on 64-Bit Architecture


Challenge

Use pragmas with the Intel® C++ Compiler for Linux* to improve performance on the Intel® Itanium® processor. A pragma is a statement used to suggest that a compiler optimization be performed or to convey information about the code that the compiler might not be able to determine. It is important to realize that a pragma is only a hint or suggestion to the compiler; it may be ignored.


Solution

Insert the correct pragmas and directives in your code to specify software pipelining, loop unrolling, loop distribution, data prefetching, trip counts, and loop-carried dependencies. Typically, a pragma is placed before a statement (e.g., a loop) that you would like to optimize. Pragmas are written using a special syntax. For the Intel C++ Compiler Linux.7.0 for Linux, they are of the form "#pragma directive."

  • Pipelining: To request that a loop be software pipelined, use the "swp" directive (i.e., #pragma swp). You should place the pragma immediately before the loop. Pipelining can reduce the number of cycles to execute a loop by overlapping the execution of multiple iterations. The Itanium 2 processor supports software pipelining in hardware with the use of rotating registers. To request that a loop not be software pipelined, use the "noswp" directive. This may be advantageous for loops that iterate few times, as pipelining introduces overhead in executing the prolog and epilog code sequences.
  • Loop Unrolling and Distribution: Loop unrolling replicates the loop body, adjusting induction variables so that semantics are preserved. This gives the compiler more flexibility in performing instruction scheduling of small loop bodies by increasing instruction-level parallelism (ILP). It also reduces the overhead associated with each iteration, as the unrolled loop will iterate fewer times. Since the Itanium 2 processor has six arithmetic and logical units and four floating-point units, it can be critical to performance that several instructions are available for scheduling. The directive "unroll(u)" can be used to suggest that the compiler unroll the loop body u times. Unrolling a loop, however, can increase register pressure and code size. To indicate that a loop should not be unrolled, use the directive "nounroll", or equivalently, "unroll(0)".

    Sometimes it is beneficial to split one loop into two loops. This is called loop distribution. Resulting loops contain a subset of the instructions from the initial loop. Thus, distribution can enable software pipelining, reduce register pressure, and improve both instruction and data cache use. To specify that a loop should be split, insert the "distribute point" directive before the instructions you would like to be placed in the second loop. The compiler will assume that there are no loop-carried dependencies. If you place the directive before the loop, ecc will determine the split point and observe any dependencies to ensure the transformation is valid.
  • Data Prefetching, Trip Counts, and Loop-Carried Dependencies: Data prefetching obtains data before it is needed, overlapping memory accesses with computations that do not require the data. Loops that iterate over large arrays using a regular access pattern can benefit from prefetching. Prefetching can ensure that array accesses are satisfied by the L1 cache, which results in the lowest possible memory latency of one cycle. Placed before a loop, the directive "prefetch [array1, array2]" suggests that array1 and array2 be prefetched within the loop. Thus, if array1[i] is accessed within the loop, the compiler will prefetch 64 bytes of array elements, beginning at array1[i+d], where d is determined by the compiler. This directive must be used in conjunction with the -O3 switch. It also may be useful to specify that an array not be prefetched. This is accomplished with the "noprefetch [array3]" directive. For example, if all accesses to array3 only reference a small slice of the array, prefetching can result in unnecessary data movement and wasted memory cycles.

    Sometimes the compiler is unable to determine the number of times a loop will iterate (i.e., the trip count of the loop). This could occur if the loop continuation condition involves a pointer dereference. Without knowing the trip count, the compiler is unable to perform most loop transformations, and performance is degraded. To assist the compiler, you can specify a trip count of n with the directive "loop count (n)".

    Loop-carried dependencies constrain the order in which loop iterations can be performed. For example, an early iteration of a loop may write a value required by a later iteration. When such a loop-carried dependency exists, the loop cannot be easily parallelized or pipelined. If the compiler is unable to detect that no loop-carried dependencies exist, it must make the conservative assumption that one exists. This will result in ILP being wasted. The directive "ivdep" placed before a loop is a programmer assertion that either there are no loop-carried dependencies or that they may safely be ignored.

 


Source

Directives and Pragmas and Switches Oh My!

 


For more complete information about compiler optimizations, see our Optimization Notice.