Tuning SIMD vectorization when targeting Intel® Xeon® Processor Scalable Family

Introduction

The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake.

For best possible performance on the Intel Xeon Processor Scalable Family, applications should be compiled with processor-specific option [Q]xCORE-AVX512 using the Intel® C++ and Fortran compilers. Note that applications built with this option will not run on non-Intel or older instruction-sets based processors.

Alternatively, applications may also be compiled for multiple instruction-sets targeting multiple processors; for example, [Q]axCORE-AVX512,CORE-AVX2 might generate a fat binary with code-paths optimized for both CORE-AVX512 (codenamed Skylake server) and CORE-AVX2 (codenamed Haswell or Broadwell) target processors along with the default Intel® SSE2 code-path. To generate a common binary for the Intel Xeon Processor Scalable Family and the Intel® Xeon PhiTM x200 processor family, applications should be compiled with option [Q]xCOMMON-AVX512.

 

What has changed?

It is important to note that choosing the widest possible vector width, 512-bit on the Intel Xeon Processor Scalable Family, may not always result in the best vectorized code for all loops, especially for loops with low trip-counts commonly seen in non-HPC applications.

Based on careful study of applications from several domains, it was decided to introduce flexibility in SIMD vectorization for the Intel Xeon Processor Scalable Family, defaulting to 512-bit ZMM usage as low that can be tuned for higher usage, if beneficial. Developers may use the Intel compilers' optimization-reports or the Intel® Advisor to understand the SIMD vectorization quality and look for more opportunities.

Starting with the 18.0 and 17.0.5 Intel compilers, a new compiler option [Q/q]opt-zmm-usage=low|high is added to enable a smooth transition from the Intel® Advanced Vector Extensions 2 (Intel® AVX2) with 256-bit wide vector registers to the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with 512-bit wide vector registers. This new option should be used in conjunction with the [Qa]xCORE-AVX512 option.

By default with [Qa]xCORE-AVX512, the Intel compilers will opt for more restrained ZMM register usage which works best for some types of applications. Other types of applications, such as those involving intensive floating-point calculations, may benefit from using the new option [Q/q]opt-zmm-usage=high for more aggressive 512-bit SIMD vectorization using ZMM registers.

 

What to do to achieve higher ZMM register usage for more 512-bit SIMD vectorization?

There are three potential paths to achieve this objective. Here is a trivial example code for demonstration purposes only:

$ cat Loop.cpp
#include 
void work(double *a, double *b, int size)
{
    #pragma omp simd
    for (int i=0; i < size; i++)
    {
        b[i]=exp(a[i]);
    }
}

 

1. The first option, starting with the 18.0 and 17.0.5 compilers, is to use the new compiler option [Q/q]opt-zmm-usage=high in conjunction with [Qa]xCORE-AVX512 for higher usage of ZMM registers for potentially full 512-bit SIMD vectorization. Using this new option requires no source-code changes, and hence is much easier to use in achieving more aggressive ZMM usage for the entire compilation unit.

Compiling with default options, compiler will emit a remark suggesting to use new option:

    $ icpc -c -xCORE-AVX512 -qopenmp -qopt-report:5 Loop.cpp
    …
    remark #15305: vectorization support: vector length 4
    …
    remark #15321: Compiler has chosen to target XMM/YMM vector. Try using -qopt-zmm-usage=high to override
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 19.500
    remark #15478: estimated potential speedup: 5.260
    …

Compiling with the new recommended option, above remark goes away and speedup increases for this example thanks to better SIMD gains with higher ZMM usage:

    $ icpc -c -xCORE-AVX512 -qopt-zmm-usage=high -qopenmp -qopt-report:5 Loop.cpp
    …
    remark #15305: vectorization support: vector length 8
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 9.870
    remark #15478: estimated potential speedup: 10.110
    …

2.As an alternative to using this new compiler option, applications may choose to use the simdlen clause with the OpenMP simd construct to specify higher vector-length to achieve 512-bit based SIMD vectorization. Note that this type of change is localized to the loop in question, and needs to be applied for other such loops as needed, following typical hotspot tuning practices. So, using this path requires modest source-code changes.

Using the simdlen clause we get better SIMD gains for this example:

    #pragma omp simd simdlen(8)
    for (int i=0; i < size; i++) …
    
    $ icpc -c -xCORE-AVX512 -qopenmp -qopt-report:5 Loop.cpp
    …
    remark #15305: vectorization support: vector length 8
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 9.870
    remark #15478: estimated potential speedup: 10.110
    …

3. Applications built with the [Qa]xCOMMON-AVX512 option already get higher ZMM register usage and, therefore, don't need to take any additional action using either of above two paths. Note, however, that while such applications have the advantage of being able to run on a common set of processors supporting Intel AVX-512, such as the Intel Xeon Processor Scalable Family and the Intel® Xeon PhiTM x200 processor family, they may potentially miss out on the smaller subset of processor specific Intel AVX-512 instructions not generated with [Qa]xCOMMON-AVX512. Note also that some types of applications may perform better with the default [Q/q]opt-zmm-usage=low option.

 

Conclusion

When compiling for the Intel® Xeon® Scalable Processor family with -xcore-avx512 (/Qxcore-avx512), applications may benefit from more aggressive optimization with the additional option -qopt-zmm-usage=high (/Qopt-zmm-usage:high) if significant time is spent in vectorized loops, unless typical loop trip counts are small or the application is significantly memory bound.

Applications with few vectorized loops or low trip counts may perform better with -qopt-zmm-usage=low (/Qopt-zmm-usage:low) which is the default for -xcore-avx512 (/Qxcore-avx512).

 

For more complete information about compiler optimizations, see our Optimization Notice.