The Importance of Vectorization for Intel Microarchitectures (Fortran Example)

Reference Link and Download

Intel Vectorization Tools
Download the source code referenced here from the link at the end of this article.

Introduction

To get good performance out of the Intel® processors, including the Intel® Xeon® Scalable Processors, applications need to take advantage of the SIMD (single instruction, multiple data) registers, as well as multiple cores. Ways of doing this range from calling suitably optimized library functions to writing assembly code or calling intrinsic functions that mimic assembly. The former is straightforward, but limited to standard library functions; the latter requires a great deal of effort and expertise. The compiler vectorizer provides a middle way that allows effective optimization of a wide range of codes without requiring a large effort or in depth understanding of the microarchitecture.

For the Intel® Fortran and C++ Compilers, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. The programmer may be able to help the compiler vectorize more loops through a simple programming style and by explicit help through compiler directives.

This article illustrates the use of the vectorizer and how it can greatly improve the performance of a simple Fortran application that performs a three dimensional integration and is already parallelized using OpenMP* directives. It concludes with some general advice for getting loops to vectorize.

Baseline

The small application cube_charge calculates the electrostatic potential at a series of points outside a uniformly charged cube. It does this by performing a three dimensional integral over the cube using the trapezoidal rule. The application is threaded using OpenMP* over the number of external points. (It could alternatively be threaded over the outermost integral). It is computation-intensive, with very little data movement. 

The application is built for parallelization by default:

ifort -O2 -qopenmp -xhost cube_charge.f90 threed_int.f90 twod_int.f90 trap_int.f90 func.f90
./a.out < points.dat

It runs in roughly 0.09 seconds per data point on an Intel® Xeon® Platinum 8180M CPU @ 2.50GHz (formerly code named Skylake) system running 56 OpenMP threads. For comparison, it runs in roughly 3.19 seconds per point for a single OpenMP thread. The OpenMP environment, including the number of threads, can be set explicitly via environment variables, for example on Linux*:

export OMP_NUM_THREADS=56
export OMP_PROC_BIND=spread

Reports

A report of which loops were vectorized, which were not, and the reasons why not may be obtained by specifying these compiler options, -qopt-report -qopt-report-phase=vec.


ifort -O2 -qopenmp -qopt-report -qopt-report-phase=vec -xhost cube_charge.f90 threed_int.f90 twod_int.f90 trap_int.f90 func.f90

The report shows no vectorized loops, for example,


Non-optimizable loops:

LOOP BEGIN at trap_int.F90(28,5)
   remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate.
LOOP END

The call to FUNC inside the DO-loop in TRAP_INT is preventing vectorization of the innermost loop of the integration.

Inlining

Building with interprocedural optimization, -ipo, allows FUNC to be inlined. Interprocedural optimization is done at link-time and the optimization report, by default, is written to ipo.optrpt.


ifort -O2 -qopenmp -ipo -qopt-report -qopt-report-phase=vec -xhost cube_charge.f90 threed_int.f90 twod_int.f90 trap_int.f90 func.f90
…
Begin optimization report for: TRAP_INT

    Report from: Vector optimizations [vec]

LOOP BEGIN at func.F90(27,41) inlined into trap_int.F90(34,22)
   remark #15300: LOOP WAS VECTORIZED
LOOP END

The program now executes in around 0.01 seconds per data point using 56 OpenMP threads and vectorization.

Inlining the source from FUNC by hand into TRAP_INT would have achieved the same result. If FUNC and TRAP_INT had been in the same source file, the compiler would have succeeded in inlining without -ipo at the -O2 optimization level. Another alternative would be to declare FUNC as a vector function, using the !$OMP DECLARE SIMD directive.

Use ZMM Vector Registers and 512 Bit SIMD Instructions

Another compiler option may improve performance for Intel's Xeon Scalable Processors.  -qopt-zmm-usage=[high|low] defines the heuristics that the compiler uses to define the SIMD width. The optimization report can offer suggestions on when to use it. It is not always easy to predict which setting may improve performance. It is best to just try it.

LOOP BEGIN at func.f90(19,41) inlined into trap_int.f90(27,22)
   remark #15305: vectorization support: vector length 8
   remark #15309: vectorization support: normalized vectorization overhead 0.507
   remark #15300: LOOP WAS VECTORIZED
   remark #26013: Compiler has chosen to target XMM/YMM vector. Try using -qopt-zmm-usage=high to override
   ...
LOOP END

Results Summary

The table shows the approximate time per point for runs with and without threading and with or without vectorization.

Multithreaded (56 threads)

Vectorized

Time per point (approx)

No

No

3.19 s

No

Yes

0.34 s

Yes

No

0.09 s

Yes

Yes

0.01 s

Timings are measured for an Intel® Xeon® Platinum 8180M CPU @ 2.50GHz system with 56 cores (112 threads) running Fedora release 25 (Twenty Five).

Guidelines for a Loop to be Auto-Vectorizable

If a loop is part of a loop nest, it should normally be the inner loop. Outer loops can be parallelized using OpenMP directives, but they can rarely be auto-vectorized, unless the compiler is able either to fully unroll the inner loop or to interchange the inner and outer loops.

The loop should contain straight-line code (a single basic block). There should be no jumps or branches, but masked assignments are allowed.

The loop should be countable, that is, the number of iterations should be known before the loop starts to execute, though it need not be known at compile time. Consequently, there should be no data-dependent exit conditions.

There should be no backward loop-carried dependencies. For example, the loop must not require statement 2 of iteration 1 to be executed before statement 1 of iteration 2 for correct results. This allows consecutive iterations of the original loop to be executed simultaneously in a single iteration of the unrolled, vectorized loop.

OK (vectorizable): A(i-1) is always computed before it is used:


DO i=2,MAX
   A(i) = B(i) + C(i)
   D(i) = E(i) – A(i-1)
ENDDO

Not OK (unvectorizable): A(i-1) might be needed before it has been computed:


DO i=2,MAX
   D(i) = E(i) – A(i-1)
   A(i) = B(i) + C(i)
ENDDO

However, the compiler may sometimes be able to transform the loop, for example, by reordering the loop or splitting it into sub-loops, so that it becomes vectorizable.

There should be no special operators and no function or subroutine calls, unless these are inlined by the compiler, either manually or automatically, or the functions or subroutines have OpenMP SIMD versions. Intrinsic math functions such as sin(), log(), max(), etc. are allowed since the compiler runtime library contains vectorized versions of these functions.

Both reductions and vector assignments to arrays are allowed.

Helping the Compiler to Vectorize

Try to avoid mixing vectorizable data types in the same loop (except for integer arithmetic on array subscripts). Vectorization of type conversions can be inefficient.

Try to access contiguous memory locations. For Fortran, the innermost, vectorizable loop should be over the first array index. Whilst the compiler may sometimes be able to vectorize loops with indirect or non-unit stride memory addressing, the cost of gathering data from or scattering back to memory may be considerable.

The directive !DIR$ IVDEP may be used to advise the compiler that there are no loop-carried dependencies that would make auto-vectorization unsafe.

The directive !DIR$ VECTOR ALWAYS may be used to override the compiler’s heuristics that determine whether auto-vectorization of a loop is likely to yield a performance benefit.

See the main compiler documentation for other directives such as LOOPCOUNT and for explicit vectorization using OpenMP SIMD directives.

Conclusion

The compiler vectorizer can help you to get good performance out of the SIMD architecture, in addition to the benefits of threading over many cores.

Users should look for hot inner loops, use the reports to see whether they are vectorized, and, if necessary, help the compiler vectorize them. For applications dominated by vectorizable kernels, the speedups may be large.

Additional Resources

Vectorizing Loops

Requirements for Vectorizable Loops

Requirements for Vectorizing Loops with #pragma SIMD

Vectorization Guides

A Guide to Auto-Vectorization with Intel C Compilers

Tuning SIMD Vectorization When Targeting Intel Xeon Scalar Processor Family

Explicit Vector Programming in Fortran

Compiler Documentation

The Intel® Fortran Compiler Developer Guide and Reference online.

Performance Notice

The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

AttachmentSize
File cube_charge.tar.gz1.89 KB
For more complete information about compiler optimizations, see our Optimization Notice.