AVX512 slower than AVX2 with Intel MKL dgemm on Intel Gold 5118

AVX512 slower than AVX2 with Intel MKL dgemm on Intel Gold 5118

We have been evaluating Intel Parallel Studio 19, Intel MKL and Intel Gold 5118 processors.
The Intel Gold 5118 processor supports AVX512.

During our investigation, we notice that when using AVX512, the wall clock solution time of
our product is no better than, or sometimes slower than, the wall clock solution solution
time when using AVX2.

We investigated in detail and we have created a small test case that replicates the issue.
This test case is based on John D. McCalpin's program simple-MKL-DGEMM-test, which we obtained
from github.

Please see file dgemm-test01.tgz.  This tarfile includes the source code, make script and results obtained
on our Linux computer.  You can see the compilation and linking options used in the file make.sh (sh make.sh)
The compilation is done using Intel Parallel Studio 19, version 19.0.4.243, and the corresponding
Intel MKL libraries are statically linked into the executable.

The file output.out gives the results from running the test program using script runtest.sh:

sh runtest.sh >& output.out

output.out shows the Linux version, output from /proc/cpuinfo and results.

The test program was run with only one core.  

The test program was run with 4 options:

MKL_ENABLE_INSTRUCTIONS=SSE4_2
MKL_ENABLE_INSTRUCTIONS=AVX
MKL_ENABLE_INSTRUCTIONS=AVX2
MKL_ENABLE_INSTRUCTIONS=AVX512

perf stat was used to obtain detailed statistics for each option and the results are given in file output.out. 

The results are summarized in this table:

ISA         wall clock time    instructions    instructions/cycle      CPU frequency (GHz)
---------------------------------------------------------------------------------------------------------------------
SSE4_2      61.6             6.2E12                    3.16                          3.2
AVX            31.8             3.0E12                    3.03                          3.1
AVX2          16.8             1.8E12                    3.47                          3.1
AVX512       17.2             7.6E11                    1.52                          2.9

It is clear that the number of instructions / cycle is much worse for AVX512, and this causes the slowdown
compared to AVX2.

The CPU frequency is nearly the same for all of the ISAs (only one core is used, so the CPU is in turbo mode),
so the effect of slower CPU frequency for AVX512 workloads is not so important here.

My questions are

1) What is the cause of the low instructions/cycle for AVX512?

2) Is there anything that we can do to increase the instructions/cycle for AVX512 on our computer?

3) Is the trend in decrease of instructions/cycle for AVX512 common to all Skylake processors, or are there Skylake processors that do
not have this decrease in instructions/cycle?

Thanks in advance,

Victor

AttachmentSize
Downloadapplication/x-gtar dgemm-test01.tgz2.21 MB
12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

This is the expected result on the Xeon Gold 5000-series processors (except the Xeon Gold 5122/5222).

If you look on the Intel product page for your processor (https://ark.intel.com/content/www/us/en/ark/products/120473/intel-xeon-g...), there is an entry near the bottom of the page for "# of AVX-512 FMA Units", where it reports that the processor has 1.    

A single AVX-512 unit has the same peak performance per cycle as the two 256-bit AVX2 units in the processor, but typically runs at a lower frequency. 

The Xeon Bronze 3000, Xeon Silver 4000, and Xeon Gold 5000 processors have one AVX-512 unit per core (except the Xeon Gold 5122/5222, which has two).

The Xeon Gold 6000 processors and Xeon Platinum 8000 processors all have two AVX-512 units per core.

"Dr. Bandwidth"

Hello John,

Thank you for posting.  Can you confirm that my understanding is correct?  I apologize for my limited understanding of this issue.

On a single core of an Intel 5118 Gold processor, a stream of AVX2 instructions is passed to the two AVX2 units, so that the AVX2 instructions are processed in parallel.  This parallel effect increases the number of instructions per cycle.  But a stream of AVX512 instructions cannot be processed in parallel because there is only one AVX512 unit on the core, so the number of instructions per cycle is not increased.

So there is no advantage in using AVX512 on this processor, at least not for dgemm calculations.  And AVX512 becomes even worse relative to AVX2 when using SMP because the cpu frequency slowdown is more pronounced with AVX512 workloads.

Thanks in advance,

Victor

Your summary is correct.

On the processors with one AVX-512 unit, the AVX-512 instruction set might provide some performance advantages for codes that can exploit its special features (masking, gather/scatter instructions, etc), but I can't point to any specific examples.

"Dr. Bandwidth"

Hello John,

I looked at the Intel product page for the 5118 processor and I can see the entry "# of AVX-512 FMA Units = 1". But I don't see anywhere on that product page the number of 256-bit AVX2 units.  How can I determine the number of AVX2 units for a given processor?  Do all Gold processors have 2 AVX2 units?

Thanks in advance,

Victor

As far as I can tell, all Intel Haswell, Broadwell, Skylake (client), Skylake (server), and the client and server Skylake follow-on processors all have two 256-bit AVX2+FMA units.

One place where the distinction is mentioned is in Chapter 2 of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-041, April 2019).   In section 2.1 "The Skylake Server Microarchitecture", the text notes:

The green stars in Figure 2-1 represent new features in Skylake Server microarchitecture compared to Skylake microarchitecture for client; a 1 MiB L2 cache and an additional Intel AVX-512 FMA unit on port 5 which is available on some parts. (emphasis added)

There is also a footnote above the new feature list containing a typical disclaimer, "Some features may not be available on all products."

In addition to the "green stars" in Figure 2-1, there is a red box labelled "AVX-512 Port Fusion", that includes all the vector functions of Port 0 and Port 1.  Port 0 and Port 1 are the locations of the 256-bit vector FMA units for Haswell/Broadwell, Skylake (client), Skylake (server), and newer processors. These two units are logically combined to create the single AVX-512 unit on the "low end" Xeon Scalable processors.   While it is possible to implement AVX-256 instructions using 128-bit FMA units (as in AMD's first-generation EPYC processors), I don't know of any Intel processors that implement the AVX2 instruction set without also including two full 256-bit pipelines.

"Dr. Bandwidth"

Hello John,

Thank you for your answer. My understanding is that, for a CPU with only one AVX-512 unit/core, it is better not to use AVX512 instructions, but to use AVX2 instructions instead.  However, for a CPU with two AVX-512 units/core, it is better to use AVX512 instructions.  (The above observation based on workflows dominated by DGEMM calls.)

So I would like to auto-detect at run time whether the CPU has one or two AVX-512 units/core.  I can use the _may_i_use_cpu_feature intrinsic to detect if the CPU supports AVX512 instructions.  But I don't know how to detect the number of AVX-512 units/core. 

I suppose that I could time DGEMM calls with and without enabling AVX512 instructions, and if the calls are faster with AVX512, then I can assume that there are two AVX-512 units/core.  But this approach seems inelegant.

Thanks in advance,

Victor

I don't know of any feature that will tell you the number of AVX512 units.  If there was such an interface, I would have expected Intel to use it in MKL and in their optimized LINPACK benchmark to switch to 256-bit vectors on the Xeon Silver and Gold processors that have only one AVX512 unit.

Even if a processor has two AVX512 units, it is often faster to use 256-bit SIMD instructions because of their advantage in maximum Turbo frequency.  Intel compilers will typically generate 256-bit SIMD instructions for the CORE-AVX512 target.  The documentation suggests that the compiler attempts to model performance with different vector widths and chooses the best.   On the Xeon Platinum 8160, for example, the maximum all-core (24-core) Turbo frequency for "high-current" AVX512 instructions is 2.0 GHz, while the corresponding value for "high-current" 256-bit instructions is 2.5 GHz.  The 512-bit SIMD instructions would have to reduce estimated execution time (in cycles) by more than 20% relative to the 256-bit SIMD version to make AVX-512 worthwhile.  

"Dr. Bandwidth"

Hi John, I doubt the second AVX512 unit can run AVX2 instructions, do you have the data on your Xeon Platinum 8160 with MKL_ENABLE_INSTRUCTIONS=AVX2 ? 

Thanks

Jason Ye

The second AVX-512 unit does not execute AVX-256 instructions -- these are executed in the 256-bit units behind ports 0 and 1.   
In AVX-512 mode, the 256-bit execution units behind ports 0 & 1 are "fused" into a 512-bit AVX-512 unit, and (on parts with a second AVX-512 unit) the second unit is accessed via port 5.  This is all described in Figure 2-2 of the Intel Architectures Optimization Reference Manual (document 248966-042b, September 2019).

"Dr. Bandwidth"

Yes, :).  For GEMM or LINPACK like application, we definitely need AVX512 code to get the best perf on processors with 2 AVX512 units, while use AVX2 on processors with one AVX512 units may get slightly higher perf due to higher base and turbo frequency

In your previous comments, you mentioned "Even if a processor has two AVX512 units, it is often faster to use 256-bit SIMD instructions because of their advantage in maximum Turbo frequency. " which implies both AVX512 units can run fused AVX2 instructions 

> [...] "Even if a processor has two AVX512 units, it is often faster to use 256-bit SIMD instructions because of their advantage in maximum Turbo frequency. " which implies both AVX512 units can run fused AVX2 instructions [...]

I don't think that implication can be derived from my statement?   Sorry if I was not clear....

I have not tested the CORE_POWER.LVL*_TURBO_LICENSE values for AVX-512 instructions using 256-bit register operands (or smaller).  The compiler generates AVX or AVX2 instructions for 256-bit SIMD unless specific AVX-512 functionality is required (e.g., masks).     If these instructions correspond to a "Level 2 Turbo License", then the core will run at the same frequency as it does with AVX/AVX2 instructions, but if they require "Level 3 Turbo License" then performance would probably be better using 512-bit register operands (since the frequency would be the same).

I did notice in the STREAM benchmark that when the loops run for much longer than 1 millisecond the "Copy" kernel (which uses only load and store instructions) runs at a "Turbo License" level that is one step lower than the "Scale", "Add", and "Triad" kernels (all of which perform 64-bit arithmetic in addition to loads and stores), with correspondingly higher frequency.  This contributes to an average frequency that does not make much sense -- e.g., about 20% of the time running at 2.5 GHz and 80% of the time running at 2.0 GHz (on the Xeon Platinum 8160, using all cores and COMMON-AVX512 code).

"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today