How Intel® AVX2 Improves Performance on Server Applications

The latest Intel® Xeon® processor E5 v3 family includes a feature called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which can potentially improve application performance related to high performance computing, databases, and video processing. Here we will explain the context, and provide an example of how using Intel® AVX2 improved performance for a commonly known benchmark.

For existing vectorized code that uses floating point operations, you can gain a potential performance boost when running on newer platforms such as the Intel® Xeon® processor E5 v3 family by doing one of the following:

  1. Recompile your code, using the Intel® compiler with the proper AVX2 switch to convert existing SSE code. See the Intel® Compiler Options for Intel® SSE and Intel® AVX generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2) and processor-specific optimizations for more details.
  2. Modify your code's function calls to leverage the Intel® Math Kernel Libraries (Intel® MKL) which are already optimized to use AVX2 where supported
  3. Use the AVX2 intrinsic instructions. For high level language (such as C or C++) developers, you can use Intel® Intrinsic instructions to make the call and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details
  4. Code in assembly instructions directly. For low level language (such as assembly) developers, you can use those equivalent AVX2 instructions from their existing SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details

In this article, I will share a simple experiment using the Intel® Optimized LINPACK benchmark to demonstrate the performance gain of three different sized workloads (30K, 40K, and 75K) from Intel AVX2 running on Windows* and Linux* operating systems. I will also share the list of AVX2 instructions that were executed and the equivalent SSE instructions for developers who are interested in direct coding.

I used the following platform for the experiment:

CPU & Chipset

Model/Speed/Cache: E5-2699 v3 QGN1, 2.3GHz, 45MB Cache, 145W TDP, C-1 Step

  • # of cores per chip: 18
  • # of sockets: 2
  • Chipset: Intel C610 "Wellsburg" series chipset, QS (B-1 step)
  • System bus: 9.6GT/s QPI

Platform

Brand/model: Intel EPSD Wildcat Pass

  • Chassis: Intel 2U Rackable
  • Baseboard: codenamed Wildcat Pass, 3 SPC DDR4
  • Board revision: Qual / PBA H30334-200
  • BIOS: SE5C610.86B.01.01.556.061320140714 BMC 0.20.6013 ME 03.00.05.402 SDR 0.10
  • Dimm slots: 24
  • Power supply: 1x 1100W Removable S-1100ADU00-201 (Rev S3)
  • CD ROM: TEAC Slim
  • Network (NIC): Onboard 10GbE

Memory

Memory Size: 128GB (8x16GB) DDR4 2133P
Brand/model: Samsung M393A2G40DB0-CPB
DIMM info: 16GB 2Rx4 PC4-2133P-RA0-10-DC0

Mass storage

Brand & model: Intel SSD S3500 Series (SSDSC2BB240G401)
Number/size/RPM/Cache: 1ea - 240GB
Plus Intel SSD P3700 Series (SSDPEDMD400G4)

Operating system

Microsoft* Windows Server 2012 R2 / SLES 11 SP3 Linux*

Procedure for running LINPACK:

  1. Download and install the following:
    1. Intel® Math Kernel Library – LINPACK Download
      http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
    2. Intel® Math Kernel Library (Intel® MKL)
      http://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
  2. Create input files for 30K, 75K, 100K from the “...\linpack” directory
  3. For optimal performance, make the following Operating System and BIOS setting changes before running LINPACK:
    1. Turn off hyper-threading in the bios
    2. For Windows, set “MKL_CBWR=AVX2” on the command line and update the runme_xeon64.bat file to use the input files you previously created. For Linux, export the “MKL_CBWR=AVX2” setting on the command line and update the runme_xeon64 shell script file to use the input files you created.
    3. The results will be in Glops similar to Table 2
  4. For Intel AVX runs, you will need to set the “MKL_CBWR=AVX” and repeat above steps.
  5. For Intel SSE runs, you will need to set the “MKL_CBWR=SSE4_2” and repeat above steps.

What are the equivalent instructions for Intel AVX2, AVX, and SSE that were executed?

Table 1 has a list of equivalent instructions for Intel AVX2, AVX and SSE (SSE/SSE2/SSE3/SSE4). If you are thinking of moving your existing code to Intel AVX2.

Intel AVX2 Instructions from the LINPACK Runs

Equivalent Intel AVX Instructions

Equivalent Intel SSE Instructions
(SSE/SSE2/SSE3/SSE4)

Definitions

VADDPD

VADDPD

ADDPD

Add Packed Double-Precision Floating-Point Values

VADDSD

VADDSD

N/A

Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

VBROADCASTSD

VBROADCASTSD

N/A

Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.

VCMPPD

VCMPPD

N/A

Compare packed double-precision floating-point values

VCOMISD

VCOMISD

N/A

Perform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register

VDIVSD

VDIVSD

DIVSD

Divide low double-precision floating point value in xmm2 by low double-precision floating-point value in xmm3/m64

VEXTRACTF128

VEXTRACTF128

N/A

Extract 128 bits of float data from ymm2 and store results in xmm1/mem.

VEXTRACTI128

N/A

N/A

Extract 128 bits of integer data from ymm2 and store results in xmm1/mem.

VFMADD213PD

N/A

N/A

Multiply packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem and put result in xmm0.

VFMADD213SD

N/A

N/A

Multiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem and put result in xmm0.

VFMADD231PD

N/A

N/A

Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.

VFMADD231SD

N/A

N/A

Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.

VFNMADD213PD

N/A

N/A

Multiply packed double-precision floating-point values from xmm1 and xmm2/mem. -negate the multiplication result and add to xmm0. Put the result in xmm0.

VFNMADD213SD

N/A

N/A

N/A

VFNMADD231PD

N/A

N/A

Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and add to ymm0. Put the result in ymm0.

VINSERTF128

VINSERTF128

N/A

Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.

VMAXPD

N/A

N/A

Determines the maximum of float64 vectors. The corresponding Inte® AVX instruction is VMAXPD.

VMAXSD

N/A

N/A

Determines the maximum of Single-Precision float64 vectors. The corresponding Intel® AVX instruction is VMAXSD

VMOVAPD

VMOVAPD

MOVAPD

Move Aligned Packed Double-Precision Floating-Point Values

VMOVAPS

VMOVAPS

MOVAPS

Move Aligned Packed Single-Precision Floating-Point Values

VMOVD

VMOVD

N/A

Move Double

VMOVDQU

VMOVDQU

MOVDQU

Move Unaligned Double Quadword

VMOVHPD

VMOVHPD

MOVHPD

Move High Packed Double-Precision Floating-Point Value

VMOVQ

VMOVQ

N/A

Move Quadword

VMOVSD

VMOVSD

MOVSD

Move Data from String to String

VMOVUPD

VMOVUPD

MOVUPD

Move Unaligned Packed Double-Precision Floating-Point Values

VMOVUPS

VMOVUPS

N/A

Move Unaligned Packed Single-Precision Floating-Point Values

VMULPD

VMULPD

MULPD

Multiply Packed Double-Precision Floating-Point Values

VMULSD

VMULSD

N/A

Multiply Packed Single-Precision Floating-Point Values

VPADDQ

N/A

N/A

Add packed Quad-precision floating-point values

VPBLENDVB

N/A

N/A

Conditionally blends word elements of source vector depending on bits in a mask vector

VPBROADCASTQ

N/A

N/A

Take qwords from the source operand and broadcast to all elements of the result vector

VPCMPEQD

N/A

N/A

Compares packed bytes/words/doublewords/quadwords of two source vectors

VPCMPGTQ

N/A

N/A

Compares packed bytes/words/doublewords/quadwords of two source vectors

VPERM2F128

VPERM2F128

MULPD

Multiply Packed Double-Precision Floating-Point Values

VPSHUFD

VPSHUFD

N/A

Permutes 32-bit blocks of an int32 vector

VPXOR

VPXOR

PXOR

Logical Exclusive OR

VUCOMISD

VUCOMISD

UCOMISD

Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS

VUNPCKHPD

VUNPCKHPD

UNPCKHPD

Unpack and Interleave High Packed Double-Precision Floating-Point Values

VUNPCKLPD

VUNPCKLPD

UNPCKLPD

Unpack and Interleave Low Packed Double-Precision Floating-Point Values

VXORPD

VXORPD

XORPD

Bitwise Logical XOR for Double-Precision Floating-Point Values

VXORPS

VXORPS

N/A

Performs bitwise logical XOR operation on float32 vectors

VZEROUPPER

VZEROUPPER

N/A

Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

Table 1 – Intel AVX2, AVX, and Intel SSE Equivalent Instructions

The list in Table 1 is just a subset. The full list can be obtained from the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Intel AVX2 and Intel AVX are complementing each other’s, the instructions will be shared to provide the necessary functionality.

What is the performance gain from running the LINPACK benchmark with Intel AVX2 vs. Intel SSE enabled and Intel AVX2 vs. Intel AVX on the Intel Xeon E5-2699 v3 processor-based server?

Table 2 shows the results from the three different workloads running on Windows* and Linux*. In the “Ratio Intel AVX2 vs. Intel SSE” column, the numbers show that the LINPACK benchmark produces ~2.2x-2.8x better performance when running with the combination of an Intel AVX2 optimized LINPACK and an Intel AVX2 capable processor. For the Intel AVX2 vs. Intel AVX column, the numbers show that the LINPACK benchmark produces ~1.3x-1.6x better performance. This is just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Windows*

Intel AVX2 (Gflops)

Intel AVX (Gflops)

Intel SSE4 (Gflops)

Ratio: Intel AVX2 vs. Intel SSE

Ratio: Intel AVX2 vs. Intel AVX

LINPACK 30K v11.1.3

735.59

562.68

331.75

2.2

1.3

LINPACK 75K v11.1.3

952.93

589.18

347.99

2.7

1.6

LINPACK 100K v11.1.3

959.90

597.66

350.51

2.7

1.6

 

Linux*

Intel AVX2 (Gflops)

Intel AVX (Gflops)

Intel SSE4 (Gflops)

Ratio: Intel AVX2 vs. Intel SSE

Ratio: Intel AVX2 vs. Intel AVX

LINPACK 30K v11.1.3

822.35

574.78

335.41

2.3

1.4

LINPACK 75K v11.1.3

964.23

610.63

346.73

2.8

1.6

LINPACK 100K v11.1.3

985.31

611.71

353.34

2.8

1.6

Table 2 – Results and Performance Gain from the LINPACK benchmark running on Intel® Xeon E5-2699 v3 two sockets server

With new AVX2 instructions and the 256 bit registers on Intel E5 processor family, LINPACK was able to take advantage of the new instructions to achieve over 2x performance in comparison to running LINPACK with SSE instructions and over 1.3x performance against LINPACK running with AVX instructions.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: Intel® Xeon® processor E5-2699 v3 @ 2.30GHz, 45MB L3 cache, 18 core pre-production system. Intel SSD S3500 Series (SSDSC2BB240G401) + Intel® SSD DC P3700 Series @ 400GB, 128GB memory (8x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: SE5C610.86B.01.01.556.061320140714 BMC 0.20.6013 ME 03.00.05.402 SDR 0.10, Power supply: 1x 1100W Removable S-1100ADU00-201, running Microsoft* Windows Server 2012 R2 / SLES 11 SP3 Linux*

For more information go to http://www.intel.com/performance

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an AVX2-enabled Intel® Xeon® processor; in this specific case, we saw a performance increase of ~2.2x-2.8x for AVX2 vs. SSE and ~1.3x-1.6x for AVX2 vs. AVX in our test environment, which is a strong case for developers who have SSE-enabled code and are weighing the benefit of moving to a newer Intel® Xeon® processor-based system with AVX2. The reference materials below can help developers learn how to migrate existing SSE code to Intel AVX2 code.

References

 

NOTICES

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.