Vectorization with the Intel® Compilers (Part I)

by Aart J.C. Bik


This article has been retired.  Please go here for vectorization information

A C++ vectorization getting-started article for new users found here is also a good substitute for this article.

The compiler options [a]x{KNBP} shown below are depreciated and may be removed in a future compiler.  DO NOT USE THESE OPTIONS

The information below may be used for learning, but please use up-to-date options listed in the link at the top of this page.

Introduction

Many general-purpose microprocessors today feature multimedia extensions that support SIMD (single-instruction-multiple-data) parallelism on relatively short vectors. By processing multiple data elements in parallel, these extensions provide a convenient way to utilize data parallelism in scientific, engineering, or graphical applications that apply a single operation to all elements in a data set, such as a vector or matrix. Vectorizing compilers that automatically take advantage of the multimedia extensions have proven to be necessary tools for making multimedia extensions easier to use. A general description of compiler methods to convert sequential code into vector or parallel form can be found in the literature [1, 2, 3, 7, 9, 10, 11, 13, 14, 15, 16, 17, 18]. An in-depth presentation of the methods that are specifically used by the Intel® C++/Fortran compilers to convert sequential code into a form that exploits multimedia extensions, a process that is called intra-register vectorization, has been given in [4, 5, 6].

This two-part article shows how to use the Intel compilers to exploit multimedia extensions effectively with a minimum of engineering efforts. The first part focuses on the compiler switches and hints that enable intra-register vectorization. The second part provides important vectorization guidelines.


Vectorization Overview: Compiler Switches

The high performance Intel® C++/Fortran compilers offer programmers a rich set of machine independent and machine specific optimizations to maximize program performance on any of the Intel platforms. This section summarizes compiler switches and hints that are specific to intra-register vectorization for the Intel® MMX™ technology and SSE/SSE2/SSE3 on Windows (refer to the compiler documentation [8] for a complete list of compiler switches and hints supported on Windows and Linux).

The C/C++ compiler is invoked from the command line as

=> icl [switches] source.c

and the Fortran compiler as

=> ifort [switches] source.f

where [switches] denotes a list of optional compiler switches. Table 1 lists compiler switches that are specific to intra-register vectorization.

 

Switch Semantics
-Q[a]xK generate code for Pentium ® III processor
-Q[a]xN generate code for Pentium ® 4 processor
-Q[a]xB generate code for Pentium ® M processor
-Q[a]xP generate code for Pentium ® 4 processor with HT
-Qvec_report control level of vectorization diagnostics:
  • n=0 disable vectorization diagnostics
  • n=1 report successfully vectorized code
  • n=2 as 1 + report failure diagnostics for loops
  • n=3 as 2 + report all prohibiting data dependences
Table 1: Compiler Switches for Intra-Register Vectorization

 

Switches -Q[a]x{KNBP} enables code generation in general and, hence, intra-register vectorization in particular, for the instruction sets supported by the Pentium® III processor, Pentium® 4 processor, Pentium® M processor, and Pentium® 4 processor with HT technology, respectively. The optional ‘a’ in the switch (e.g. -QaxP) enables automatic processor dispatch. Under this option, the compiler generates a generic version that runs on any 32-bit Intel processor but, if deemed profitable, also a version that has been optimized for the specified processor. At runtime, the program automatically selects the appropriate optimized version based on the actual processor that is used to run the program. The switch –Qvec_report0 disables all vectorization diagnostics, which is useful if a silent compilation is desired. The switch –Qvec_report1 provides feedback for all code fragments that have been successfully vectorized. Each diagnostic reports the source file with line and column number of the first statement in the vectorized code fragment. For example, suppose the contents of a source file main.c are as follows, where line numbers have been made explicit with comments.

 

#define N 32

float a[N], b[N], c[N], d[N];


doit() {

int i;

for (i = 0; i < N/2; i++) a[i] = i;

for (i = 1; i < N; i++) {

b[i] = b[i-1] + 2; /* data dependence cycle */

c[i] = 1;

}

d[0] = 10;

d[1] = 10;

d[2] = 10;

d[3] = 10;

}

...

 

Compiling this source file with the following switches on the command line yields the diagnostics shown below.

 

=> icl -QxP -Qvec_report1 main.c

...

main.c(6) : (col. 11) remark: LOOP WAS VECTORIZED.

main.c(7) : (col. 16) remark: LOOP WAS PARTIALLY VECTORIZED.

main.c(11) : (col. 11) remark: BLOCK WAS VECTORIZED.

 

The diagnostics report full vectorization of the loop at line 6, partial vectorization of the loop at line 7 (after loop distribution), and vectorization of the loop that materialized from straight-line code starting at line 11. The format is compatible with the Microsoft Visual C++ environment [12], where double-clicking on one of the diagnostics in the output window will move the focus of the editor window to the corresponding source file and position. Switches -Qvec report2 and -Qvec report3 provide feedback for the loops in a program that were not vectorized, which may be useful while trying to make the program more amenable to intra-register vectorization. Compiling the previous source file on the command line as follows, for example, yields the following, more verbose, output.

 

=> icl -QxP -Qvec_report3 main.c

...

main.c(6) : (col. 11) remark: LOOP WAS VECTORIZED.

main.c(7) : (col. 11) remark: vector dependence: proven FLOW

dependence between b line 8, and b line 8.

main.c(7) : (col. 11) remark: loop was not vectorized: existence of

vector dependence.

main.c(7) : (col. 16) remark: PARTIAL LOOP WAS VECTORIZED.

main.c(11) : (col. 11) remark: BLOCK WAS VECTORIZED.

 

The feedback generated for these switches may become rather verbose, since diagnostics are given for every loop in the program, even for loops that are very unlikely candidates for vectorization. As a result, simply using these diagnostics to compare the number of vectorized loops against the number of loops that are not vectorized generally provides a very poor measure of the quality of vectorization. The diagnostics merely serve the purpose of providing feedback that simplifies the task of rewriting important code fragments into a form that can be vectorized. Table 2 lists a number of other useful compiler switches that are not directly related to intra-register vectorization. The switch -Fa generates an assembly file, which can subsequently be inspected to determine the quality of the generated instructions. This process is illustrate d below for the source file main.c shown earlier. For brevity, only the vector instructions for the statement at line 9 after distribution of the loop at line 7 are shown. For ease of reference, the Intel compiler annotates each assembly instruction with the line and column number of the original statement in the source file (viz. ;line.col).

 

=> icl -QxP -Fa main.c

...

=> cat main.asm

...

movaps xmm0, ...

.B1.2:

movaps XMMWORD PTR _c[eax+4], xmm0 ;9.12

add    eax,        16              ;7.16

cmp    eax,        124             ;7.11

jb     .B1.2                       ;7.11

...

END

 

Switches -Od through -O3 control different levels of optimizations for code size and code speed. Switch -Qunroll0 disables loop unrolling in general and, hence, unrolling vector loops in particular. Loop unrolling that supports cache line split or prefetching instructions cannot be overridden by this switch. The compiler insertion of prefetching instructions by itself, however, is disabled with the switch -Qprefetch-. Switches -Qip and -Qipo enable interprocedural optimizations within a single source file and amongst multiple source files of a program, respectively. Amongst a wide variety of interprocedural optimizations, these switches enable the interprocedural alignment analysis for formal pointer arguments. The switch -fast enables the best possible generic switch combination to optimize for speed. Switch -Op restricts the compiler to optimizations that preserve floating-point precision, which disables the vectorization of loops that alter the way in which floating-point round-off errors accumulate (such as reductions). Switch -Qrestrict enables the use of the keyword restrict that conveys non-aliasing properties to the compiler. Finally, the switches -Qparallel and -Qopenmp enable the automatic parallelization and explicit parallelization by means of the OpenMP extensions, respectively.

Switch Semantics
-Fa generate assembly file
-Od disable optimizations
-O1 enable optimizations for code size
-O2 enable default optimizations for speed
-O3 enable all optimizations for speed
-Qunroll0 disable loop unrolling
-Qprefetch- disable software prefetching
-Qip enable intra-procedural optimizations in a single source file
-Qipo enable intra-procedural optimizations amongst source files
-fast enable best switch combination for speed
-Op restrict floating-point optimizations
-Qrestrict enable use of restrict keyword
-Qparallel enable automatic parallelization
-Qopenmp enable parallelization with OpenMP extensions
Table 2: Other Useful Compiler Switches

 


Profile-Guided Optimization

In a traditional static compilation model, the compiler must necessarily ground all optimization decisions on only an estimate of important execution characteristics. Branch probabilities, for example, are estimated by assuming that controlling conditions that test equality are less likely to succeed than condition that test inequality. Relative execution counts are based on static properties such as nesting depth. These estimated execution characteristics are subsequently used to make optimization decisions such as selecting an appropriate instruction layout, procedure inlining, or generating a sequential and vector version of a loop. The quality of such decisions can substantially improve if more accurate execution characteristics are available, which becomes possible under profile-guided optimization.

 

Switch Semantics
-Qprof_gen generate instrumented program
-Qprof_use use profile summa ry to optimize program
Table 3: Compiler Switches for Profile-Guided Optimization

 

Table 3 shows the compiler switches that enable the following three-step compilation session for profile-guided optimization.

  1. Generate an instrumented program with –Qprof_gen.
  2. Apply the instrumented program to representative input sets.
  3. Generate optimized program with –Qprof_use.

 

While the instrumented program generated in the first step is applied to one or several representative input sets in the second step, important execution characteristics are gathered in a profile summary. Optimization decisions are subsequently based on this profile summary during the third step rather than on only a static estimate of execution characteristics. Profile-guided optimization is most effective when a small, but representative training input data set is used in the second step, and the third step is combined with other optimizations, like interprocedural optimizations or vectorization. A typical command line session to compile the source file program.c into a profile-guided optimized binary for a Pentium® 4 processor with HT technology is shown below.

 

=> icl -Qprof_gen program.c

...

=> program.exe < train_input.txt

...

=> icl -Qprof_use -QxP -Qipo -O3 program.c

...

 

The new binary program.exe that is obtained in this manner can subsequently be applied to many heavy workloads.


Compiler Hints

Table 4 summarizes all the compiler hints that are relevant to automatic intra-register vectorization. The pragmas must be inserted before a loop to convey certain information about this loop to the compiler. Inserting #pragma ivdep before a loop asserts that none of the conservatively assumed data dependences that prohibit vectorization of the loop actually occur. This pragma is useful in cases where data dependence analysis fails to detect a vector loop that is obvious to a programmer with more knowledge of the application domain. If in the loop shown below, for example, variable k is always non-negative, the programmer can use #pragma ivdep to inform the compiler that the assumed flow dependence caused by potentially reading an element of a that was written in a previous iteration can safely be discarded.

 

#pragma ivdep

for (i = 0; i < N; i++) {

  a[i] = a[i+k] + 1;

}

 

If the compiler could not prove data independence statically, but was able to vectorize the loop with runtime data dependence testing, this pragma is still useful to avoid the overhead of these tests. The pragma cannot be used, however, to override proven data dependences. If in the example above, k is replaced with -1, the #pragma ivdep is simply ignored by the compiler and the loop remains sequential due to the now proven flow dependence. In situations where vectorization of a loop is valid, but built-in efficiency heuristics of the compiler deem vectorization unprofitable, #pragma vector always can be used to override this decision.

 

Syntax of Hint Semantics
#pragma ivdep discard assumed data dependences
#pragma vector always override efficiency heuristics
#pragma vector nontemporal enable streaming stores
#pragma vector [un]aligned assert [un]aligned property
#pragma novector disable vectorization
#pragma distribute point suggest point for loop distribution
#pragma loop count (<int>) estimate trip count
restrict assert exclusive access through pointer
_declspec(align(<int>,<int>)) suggest memory alignment
__assume_aligned(<var>,<int>) assert alignment property
Table 4: Compiler Hints for Intra-Register Vectorization

 

This pragma suggests vectorization of the loop regardless of the outcome of efficiency heuristics. Validity considerations, however, cannot be overridden with this pragma. In the fragment below, for instance, the pragma has no impact, because the compiler is unable to vectorize the output statement in the loop.

 

#pragma vector always

for (i = 0; i < 100; i++) {

printf("i=%d", i); /* you wish */

}

 

Conversely, #pragma novector disables vectorization of a loop, which is useful to prevent vectorization of a loop that seems profitable to the compiler, but actually exhibits a slowdown at runtime. Inserting #pragma nontemporal before a loop instructs the compiler to use streaming stores for all 16-byte aligned memory references. Similarly, with #pragma aligned and #pragma unaligned, the programmer can instruct the compiler to assume that all memory references in the loop are either 16-byte aligned or unaligned, respectively. Both pragmas must be used with care because incorrect usage can result in program faults or performance degradation. Pragmas may even be inserted before code from which a loop will most likely materialize, as illustrated with the following example.

 

float a[4];

#pragma vector nontemporal

a[0] = 0;

a[1] = 0;

a[2] = 0;

a[3] = 0;

 

translated into

 

xorps   xmm0, xmm0

movntps a,    xmm0

 

Since there is no guarantee on the materialization of loops, however, the compiler is also free to ignore such pragmas altogether. A suitable point for loop distribution can be suggested to the compiler without actually performing the source code modifications with a #pragma distribute point in the loop body, as illustrated below.

 

for (i = 0; i < N; i++) {

a[i] = 0;

#pragma distribute point

b[i] = 0;

}

 

suggests

 

for (i = 0; i < N; i++) {

a[i] = 0;

}

for (i = 0; i < N; i++) {

b[i] = 0;

}

 

Within one loop, several of these pragmas can be used to suggest multiple loop distribution points to the compiler. An estimated trip count can be conveyed to the compiler as shown below, where the programmer has indicated that the i-loop typically iterates only seven times.

 

#pragma loop count (7)

for (i = 0; i < n; i++) {

a[i] = 0;

}

 

This estimate is subsequently used by compiler to determine whether vectorization of the loop seems profitable. If the vector length exceeds the estimated trip count, for example, the loop will remain sequential. The keyword restrict asserts that a pointer variable provides exclusive access to its associated memory region. If the following function add() is only applied to distinct arrays, for example, this information can be conveyed to the compiler as follows.

 

void add(char * restrict p, char * restrict q, int n) {

int i;

for (i = 0; i < n; i++) {

p[i] = q[i] + 1;

}

}

 

This keyword provides a convenient method to assert the exclusive access property for pointer variables that are used in many different loops, because it avoids the tedious task of inserting a #pragma ivdep before every potential vector loop. In contrast with pragmas, which are typically ignored by other compilers, this language extension requires the compiler switch -Qrestrict to compile with the Intel compiler, and may cause syntax errors for other compilers. The program can still be kept portable, however, with the following macro mechanism, where the default compilation with any compiler simply discards all the restrict hints (now spelled in capitals), while the hints can be c onveyed to the Intel compiler with the switch combination “-D__RESTRICT -Qrestrict” (switch -D defines a macro name).

 

#ifdef __RESTRICT

#define RESTRICT restrict

#else

#define RESTRICT

#endif


void add(char * RESTRICT p, char * RESTRICT q, int n) {

...

}

 

The programmer can use the

 

_declspec(align(base,off)) where 0  <=  off < base = 2
n

 

in a declaration to suggest allocating the declared entity at an address a that satisfies the equality a mod base = off. Suppose, as an example, that most of the execution time of a program is spent in a loop of the following form.

 

double a[N], b[N];

. . .

for (i = 0; i < N; i++) a[i+1] = b[i] * 3;

 

Since the compiler most likely selects a 16-byte alignment for both arrays, either an unaligned load or an unaligned store has to be used after vectorization. The programmer can suggest the alternative alignment shown below for the arrays, however, which will result in two aligned access patterns after vectorization.

 

_declspec(align(16, 8)) double a[N];

_declspec(align(16, 0)) double b[N]; /* or: align(16) */

 

If program analysis has failed to determine an appropriate alignment for the memory regions that may be associated with a particular pointer variable p, the programmer can use the hint

 

__assume_aligned(p, base); where base = 2
n

 

to convey that all memory regions that are associated with the pointer variable p are guaranteed to satisfy at least the given alignment. Consider, for instance, the following function.

 

void fill(char *x) {

int i;

for (i = 0; i < 1024; i++) x[i] = 1;

}

 

If interprocedural alignment analysis (enabled under -Qip or -Qipo) has failed to derive any useful conclusion on the alignment of the memory regions associated with the formal pointer argument x, the compiler resorts to dynamic loop peeling to enforce an aligned access pattern in the i-loop. This provides a generally effective way to obtain aligned access patterns at the expense of a slight increase in code size and testing. If incoming access patterns are guaranteed 6-byte aligned, however, the programmer can avoid this overhead altogether by conveying the alignment information to the compiler as follows.

 

void fill(char *x) {

__assume_aligned(x, 16);

...

}

 

Like all other alignment hints, this hint must be used with care because incorrect usage can result in program faults.


Related Resources

 

Articles

 

Developer Centers

 

Community

 

Other Resources

 


References

[1] J.R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems, 9:491–542, 1987.

[2] Utpal Banerjee. Loop Transformations for Restructuring Compilers: The Foundations. Kluwer, Boston, 1993. A Book Series on Loop Transformations for Restructuring Compilers.

[3] Utpal Banerjee. Loop Parallelization. Kluwer, Boston, 1994. A Book Series on Loop Transformations for Restructuring Compilers.

[4] Aart J.C. Bik, Milind Girkar, Paul M Grey, and Xinmin Tian. Automatic intra-register vectorization for the Intel® Architecture. International Journal on Parallel Processing, 2001.

[5] Aart J.C. Bik, Milind Girkar, Paul M Grey, and Xinmin Tian. Efficient exploitation of parallelism on Pentium® III and Pentium® 4 Processor-based systems. Intel Technology Journal, February, 2001. See http://www.intel.com/technology/itj/.

[6] Aart J.C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. Experiments with automatic vectorization for the Pentium® 4 Processor. In Proceedings of the 9th Workshop on Compilers for Parallel Computers, Edinburgh, Scotland, UK, June 2001.

[7] Jack J. Dongarra, Iain S. Du_, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, 1991.

[8] Intel Corporation. High-performance Intel C++/Fortran compilers, 1998-2003. More information can be found at http://www.intel.com/software/products/.

[9] David J. Kuck. The Structure of Computers and Computations. John Wiley and Sons, New York, 1978. Volume 1.

[10] Leslie Lamport. The parallel execution of DO loops. Communications of the ACM, pages 83–93, 1974.

[11] John M. Levesque and Joel W. Williamson. A Guidebook to FORTRAN on Supercomputers. Academic Press, San Diego, 1991.

[12] Microsoft. MSDN Visual C++. More information can be found at http://msdn2.microsoft.com/en-us/visualc/default.aspx*.

[13] Kenneth W. Neves. Vectorization of scientific software. In J.S. Kowalik, editor, High-Speed Computing, pages 277–291. Springer-Verlag, Berlin, 1984. NATO ASI Series, Volume F7.

[14] David A. Padua and Michael J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29:1184–1201, 1986.

[15] Constantine D. Polychronopoulos. Parallel Programming and Compilers. Kluwer, Boston, 1988.

[16] Michael J. Wolfe. Optimizing Supercompilers for Supercomputers. Pitman, London, 1989.

[17] Michael J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, California, 1996.

[18] H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM Press, New York, 1990.


Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.