The Importance of Vectorization for Intel® Many Integrated Core Architecture (Intel® MIC Architecture) (Fortran Example)

Reference Link and Download

Intel Vectorization Tools
Code tarball [TAR 16KB]


To get good performance out of the Intel® Many Integrated Core architecture (Intel® MIC architecture) and systems including Intel® Xeon Phi™ coprocessors, applications need to take advantage of the 16-wide SIMD registers as well as the many cores. Ways of doing this range from calling suitably optimized library functions to writing assembly code or calling intrinsic functions that mimic assembly. The former is straightforward but limited to standard library functions; the latter requires a great deal of effort and expertise. The compiler vectorizer provides a middle way, that allows effective optimization of a wide range of codes without requiring a large effort or in depth understanding of the microarchitecture.

For the Intel® Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. The programmer may be able to help the compiler vectorize more loops through a simple programming style and by explicit help through compiler directives.

This article illustrates the use of the vectorizer and how it can greatly improve the perfrmance of a simple Fortran application that performs a three dimensional integration. It concludes with some general advice for getting loops to vectorize.


The small application cube_charge calculates the electrostatic potential at a series of points outside a uniformly charged cube. It does this by performing a three dimensional integral over the cube using the trapezoidal rule. The application is threaded using OpenMP over the number of external points. (It could alternatively be threaded over the outermost integral). It is computation-intensive, with very little data movement. After initialization, the main computation is offloaded to the coprocessor.

The application is built for and run with offload by default:

ifort -openmp cube_charge.f90 threed_int.f90 twod_int.f90 trap_int.f90 func.f90
./a.out < points.dat

It runs in roughly 1 second per data point on an Intel® Xeon Phi™ coprocessor hosted by an Intel® CoreTM i7 system, for a number of points that is 4 times the number of available physical cores. (For comparison, it runs in roughly 2 minutes per point for a single OpenMP thread). The OpenMP environment, including the number of threads, can be set explicitly via environment variables, e.g.

export MIC_KMP_AFFINITY=balanced


A report of which loops were vectorized, which were not, and the reasons why not may be obtained by specifying the –vec-report switch for the offload compilation:

ifort -openmp -offload-option,mic,compiler,"-vec-report2"
cube_charge.f90 threed_int.f90 twod_int.f90 trap_int.f90 func.f90

The report shows no vectorized loops, e.g.

trap_int.f90(33): (col. 5) remark: *MIC* loop was not vectorized:
existence of vector dependence.

The call to FUNC at line 35 of the loop at line 33 of TRAP_INT is preventing vectorization of the innermost loop of the integration.


In a code without offload directives, this could be resolved by building with interprocedural optimization, -ipo, which would allow FUNC to be inlined. However, for the offload compiler, this requires a FORCEINLINE compiler directive in addition:

function func(x,y,z,xp,yp,zp)

With this source change, (uncomment line 27 of func.f90), and specifying the -ipo switch for the offload compilation, the report shows that the loop now vectorizes:

ifort -openmp -offload-option,mic,compiler,"-vec-report2 -ipo"
cube_charge.f90 threed_int.f90 twod_int.f90 trap_int.f90 func.f90
trap_int.f90(33): (col. 5) remark: *MIC* LOOP WAS VECTORIZED.

The program now executes in around 20 seconds.

Inlining the source from FUNC by hand into TRAP_INT would have achieved the same result. If FUNC and TRAP_INT had been in the same source file, the compiler would have succeeded in inlining without help at the –O2 optimization level. Another alternative would be to declare FUNC as a vector function, using the ATTRIBUTES VECTOR directive.


The assembly for the non-vectorized code version can be seen by building with

ifort -openmp -offload-option,mic,compiler,"-S" func.f90

and inspecting the assembly file for the offload, funcMIC.s. SIMD instructions in the loop, such as adds, multiplies and fma’s, are masked down to a single element, e.g.

vmulps %zmm4, %zmm4, %zmm5{%k1}

The inverse square root is computed by a masked call to the Short Vector Math Library (SVML):

call __svml_invsqrtf16_mask@PLT

The SVML contains SIMD versions of math functions that can be used to vectorize loops containing calls to math functions. In the vectorized loop kernel, the corresponding instructions do not include the mask {k1} and the called function __svml_invsqrtf16@PLT is also not masked.

It’s not yet possible to generate the offload assembly when using IPO; eventually –ipo -S should make this possible. The SVML call can be seen if FUNC is inlined by hand and trap_int is built without –ipo.

Results Summary

The table shows the approximate time per point for runs with and without threading and with or without vectorization.



Time per point (approx)



120 s



8 s



1 s




Timings are measured for an Intel® Xeon Phi™ coprocessor with 61 cores hosted by an Intel® CoreTM i7 system (formerly code named Nehalem) running Red Hat* Enterprise Linux Server 6.2.

The speedup due to OpenMP multithreading is more than the number of cores, because a single thread can only initiate one floating-point instruction every other cycle. Creating at least twice as many threads as available cores allows this limitation to be circumvented.

Requirements for a Loop to be Vectorizable:

If a loop is part of a loop nest, it must be the inner loop. Outer loops can be parallelized using OpenMP, but they can rarely be vectorized unless the compiler is able either to fully unroll the inner loop, or to interchange the inner and outer loops.

The loop must contain straight-line code (a single basic block). There should be no jumps or branches, but masked assignments are allowed.

The loop must be countable, i.e. the number of iterations must be known before the loop starts to execute, though it need not be known at compile time. Consequently, there must be no data-dependent exit conditions.

There should be no backward loop-carried dependencies. For example, the loop must not require statement 2 of iteration 1 to be executed before statement 1 of iteration 2 for correct results. This allows consecutive iterations of the original loop to be executed simultaneously in a single iteration of the unrolled, vectorized loop.

OK (vectorizable): A(i-1) is always computed before it is used:

DO i=2,MAX
 A(i) = B(i) + C(i) 
 D(i) = E(i) – A(i-1) 

Not OK (unvectorizable): A(i-1) might be needed before it has been computed:

DO i=2,MAX
 D(i) = E(i) – A(i-1) 
 A(i) = B(i) + C(i) 

However, the compiler may sometimes be able to transform the loop, e.g. by reordering the loop or splitting it into sub-loops, so that it becomes vectorizable.

There should be no special operators and no function or subroutine calls, unless these are inlined, either manually or automatically by the compiler. Intrinsic math functions such as sin(), log(), max(), etc. are allowed since the compiler runtime library contains vectorized versions of these functions.

Both reductions and vector assignments to arrays are allowed.

Helping the Compiler to Vectorize:

Try to avoid mixing vectorizable data types in the same loop (except for integer arithmetic on array subscripts). Vectorization of type conversions can be inefficient.

Try to access contiguous memory locations. (So for Fortran, the innermost, vectorizable loop should be over the first array index). Whilst the compiler may sometimes be able to vectorize loops with indirect or non-unit stride memory addressing, the cost of gathering data from or scattering back to memory may be considerable..

The directive !DIR$ IVDEP may be used to advise the compiler that there are no loop-carried dependencies that would make vectorization unsafe.

The directive !DIR$ VECTOR ALWAYS may be used to override the compiler’s heuristics that determine whether vectorization of a loop is likely to yield a performance benefit.

See the main compiler documentation for other directives such as LOOPCOUNT or SIMD.

The compiler option –guide may be used to obtain vectorization advice for your application.


The compiler vectorizer can help you to get good performance out of Intel® MIC architecture through effective use of the SIMD hardware, in addition to the benefits of threading over the many cores.

Users should look for hot inner loops, use the reports to see whether they are vectorized, and if necessary, help the compiler vectorize them. For applications dominated by vectorizable kernels, the speedups may be large.

Additional Resources

Requirements for Vectorizable Loops:

Getting Started Tutorial, “Using Auto-Vectorization”, available with the Intel® Fortran Compiler or at

Vectorization Guide:

The Intel Fortran Compiler User and Reference Guide, in the product or at

About the Author

Martyn Corden is a Technical Consulting Engineer in the Developer Products Division within the Software Services Group at Intel. He provides technical support for the Intel Fortran and C/C++ compilers for Windows*, Linux* and Mac OS* X, with particular focus on HPC applications. Martyn came to Intel from the Supercomputer Computations Research Institute at Florida State University, where he had extensive experience with high performance scientific applications. He has 25 years of experience in writing, debugging, maintaining, porting and optimizing software for high energy physics, including for several experiments at CERN in Geneva. Martyn holds a BA in Physics from Oxford University and a Ph.D. in High Energy Physics from the University of Birmingham. In his spare time, he is an avid chess player.

Performance Notice

For more complete information about performance and benchmark results, visit

For more complete information about compiler optimizations, see our Optimization Notice.
File cube-charge.tar15.7 KB

1 comment

Pawel  A.'s picture

Thank you Martyn for this useful guide and results from Xoen Phi. I have access to Phi but was dragging my feet with the installation of the rather expensive sofware stack for it. (Well, I know the Phi isn;t cheap either and I should have known). Well, I tried to make your code sample work (cosmetic changes mostly, like different timing procedures) on the E5-2690v2 10-core 3GHz cpu (~$2200), and got the following results for he original code and a code slightly streamlined by hand, since I really don't like calling functions that contain 1 line of code (I inlined one manually and deleted). I also merged all the little files of source code into one. I used all the threads (I got good scaling with their number).

with gfortran -fopenmp -O3 -ffast-math -mavx -march=native   I get 0.9 s per point (original, not well vectorized), and 0.12 s/pt (streamlined, well vectorized)

with ifor -O2 -xhost -qopenmp -no-vec    I get ~0.6 s per point (streamlined but no vectorization),

and with ifor -O2 -xhost -qopenmp  I got 0.048 s/pt (streamlined, good vectorization).

That last number shows that a well-used 10-core CPU beats the Phi card (0.06 s/pt in your table). That's only partly because the CPU clock is 3 times faster (but Phi has about 6 times as many 86_x64 cores).

This also shows that gnu fortran yielded a 2.5 times slower code than the best ifor (which, BTW, was run for 240 points, number originally in the program, meant for Phi). I knew that this would be so but not that the difference will be so large. I'm encouraged to try ifor on my "really big" fortran codes.

I started paying attention to CPU and Phi after achieving with my student very fast executions of our fortran codes rewritten in CUDA C. This however took a long time to do and optimize. It turns out that Nvidia Titan cards are good for our purposes and are fast, but with good optimization of a CPU code as above, 1 Titan is only maybe a factor of ~2 faster than E5-2690. I'm going to continue the comparisons and... maybe go back to my fortran codes or even build a small cluster of a few 10-core dual-cpu machines, who knows. Titans have the advantage for now, because we stick 3 of them in one linux box, and they aren't so expensive as the top cpu's.


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.