How Intel® AVX Improves Performance on Server Application

The latest Intel® Xeon® processor E7 v2 family includes a feature called Intel® Advanced Vector Extensions (Intel® AVX), which can potentially improve application performance. Here we will explain the context, and provide an example of how using Intel® AVX improved performance for a commonly known benchmark.

For existing vectorized code that uses floating point operations, you can gain a potential performance boost when running on newer platforms such as the Intel® Xeon® processor E7 v2 family, by doing one of the following:

  1. Recompile your code, using the Intel® compiler with the proper AVX switch to convert existing SSE code. See the Intel® AVX State Transitions: Migrating SSE Code to AVX

  2. Modify your code's function calls to leverage the Intel® Math Kernel Libraries (Intel® MKL) which are already optimized to use AVX where supported

  3. Use the AVX intrinsic instructions. For high level language (such as C or C++) developers, you can use Intel® Intrinsic instructions to make the call and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details
  4. Code in assembly instructions directly. For low level language (such as assembly) developers, you can use those equivalent AVX instruction from their existing SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details

In this article, I will share a simple experiment using the Intel® Optimized LINPACK benchmark to demonstrate the performance gain of three different sized workloads (30K, 40K, and 75K) from Intel AVX running on Windows* and Linux* operating systems. I will also share the list of AVX instructions that were executed and the equivalent SSE instructions for developers who are interested in direct coding.

I used the following platform for the experiment:

CPU & Chipset

Model/Speed/Cache: E7-4890 v2 QFJY 2.8GHz, 37.5mb cache, 155W TDP, D1

  • # of cores per chip: 15
  • # of CPU sockets: 4
  • Chipset: Patsburg SSB-J C1
  • System bus: 8GT/s QPI

Platform

Brand/model: Intel SDP S4TR1SY2B Brickland IVT-EX Qual MM # 931237

  • Chassis: Intel 4U Rackable
  • Baseboard: Intel CRB baseboard codenamed Thunder Ridge
  • BIOS: BIVTSDP1.86B.0042.R04.1309061422, BMC 70.06.r5145 w/ Closed Chassis SDR, ME 2.3.0, FRU D.00, CPLD 1.06
  • Dimm slots: 96
  • PCI slots: 1 x4, 7 x8, 4 x16
  • Drive controller: LSI SAS9217-8i (with custom FW)
  • Power supply: 2x1200W NON-REDUNDANT (+2 empty slots)
  • CD ROM: TEAC Slim
  • Network (nic): 1x Intel Ethernet Converged Network Adapter x540-T2 "Twin Pond" (OEM-GEN)

Memory

Memory Size: 256GB (32 x 8GB) - 4 DIMMS per memory riser card (Slot A0, B0, C0, D0)

Brand/model: Samsung M393B1K70DH0-YK0 1309

DIMM info: 8GB 2Rx4 PC3L-1200R-11-11-E2-P2

Mass storage

Brand & model: Intel SSD DC S3700 Series SSDSC2BA800G3

Number/Size/RPM/Cache: 2 in RAID 0 / 800gb

Operating system

Windows Server 2012 R2 / SLES 11 SP3

Procedure for running LINPACK:

  1. Download and install the following:
    1. Intel® Math Kernel Library – LINPACK Download
      https://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
    2. Intel® Math Kernel Library (MKL)
      https://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
  2. Create three different input files for 30K, 40K, and 75K from the “...\linpack” directory
  3. For AVX runs update files as follows:
    1. For Windows, update the runme_xeon64.bat file to take new input files you that have created runme_xeon64.bat file. For Linux, update the runme_xeon64 shell script file to take the new input files.
    2. The results will be in Glops similar to the Table 2
  4. For Intel SSE runs, you will need to have an Intel AVX disabled processor and repeat the above steps.

What are the Intel AVX and the equivalent Intel SSE instructions that were executed?

Table 1 has a list of Intel AVX instructions that were executed during the Intel AVX runs. I have provided the equivalent Intel SSE instructions for those developers who are thinking of moving their existing Intel SSE code to Intel AVX.

Intel AVX Instructions
from the
LINPACK Runs

Equivalent Intel SSE Instructions
(SSE/SSE2/SSE3/SSE4)

Definitions

VADDPD

ADDPD

Add Packed Double-Precision Floating-Point Values

VBLENDPD

BLENDPD

Blend Packed Double Precision Floating-Point Values

VBROADCASTSD

N/A – Supported in AVX

Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.

VDIVSD

DIVSD

Divide low double-precision floating-point value in xmm2 by low double-precision floating-point value in xmm3/m64

VEXTRACTF128

N/A – Supported in AVX

Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.

VINSERTF128

N/A – Supported in AVX

Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.

VMOVAPD

MOVAPD

Move Aligned Packed Double-Precision Floating-Point Values

VMOVAPS

MOVAPS

Move Aligned Packed Single-Precision Floating-Point Values

VMOVDDUP

MOVDDUP

Move One Double-FP and Duplicate

VMOVDQU

MOVDQU

Move Unaligned Double Quadword

VMOVHPD

MOVHPD

Move High Packed Double-Precision Floating-Point Value

VMOVLPD

MOVLPD

Move Low Packed Double-Precision Floating-Point Value

VMOVSD

MOVSD

Move Data from String to String

VMOVUPD

MOVUPD

Move Unaligned Packed Double-Precision Floating-Point Values

VMOVUPS

MOVUPS

Move Unaligned Packed Single-Precision Floating-Point Values

VMULPD

MULPD

Multiply Packed Double-Precision Floating-Point Values

VMULSD

MULSD

Multiply Scalar Double-Precision Floating-Point Values

VPERM2F128

N/A – Supported in AVX

Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.

VPERMILPD

N/A – Supported in AVX

Shuffle the 32-bit or 64-bit vector elements of one input operand with an immediate operand as selector.

VPXOR

VPXOR

Logical Exclusive OR

VSUBPD

SUBPD

Subtract Packed Double-Precision Floating-Point Values

VSUBSD

SUBSD

Subtract Scalar Double-Precision Floating-Point Values

VUCOMISD

UCOMISD

Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS

VUNPCKHPD

UNPCKHPD

Unpack and Interleave High Packed Double-Precision Floating-Point Values

VUNPCKLPD

UNPCKLPD

Unpack and Interleave Low Packed Double-Precision Floating-Point Values

VXORPD

XORPD

Bitwise Logical XOR for Double-Precision Floating-Point Values

VZEROUPPER

N/A – Supported in AVX

Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

Table 1 – Intel® AVX and Equivalent Intel SSE Instructions

 

The list in Table 1 is just a subset of the AVX instructions available. The full list can be obtained from the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

 

What is the performance gain for running the LINPACK benchmark with Intel AVX vs. Intel SSE enabled on the Intel Xeon E7 4890 v2 server?

Table 2 shows the results from the three different workloads running on Windows* and Linux*. In the Ratio column, the numbers show that the LINPACK benchmark produces ~1.6x-1.7x better performance when running with the combination of an Intel AVX optimized LINPACK and an Intel AVX capable processor. This is just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Windows* Intel AVX (Gflops) Intel SSE (Gflops Ratio: Intel AVX/Intel SSE

LINPACK 30K v11.1.1

631.8

400.3

1.6

LINPACK 40K v11.1.1

756.4

480.6

1.6

LINPACK 75K v11.1.1

829.3

514.3

1.6

Linux*

 

 

 

LINPACK 30K v11.1.1

913.6

534.3

1.7

LINPACK 40K v11.1.1

1023.5

621.2

1.6

LINPACK 75K v11.1.1

1128.8

657.0

1.7

Table 2 – Results and Performance Gain from the LINPACK benchmark

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an AVX-enabled Intel Xeon processor; in this specific case, we saw a performance increase of ~1.6x-1.7x in our test environment, which is a strong case for developers who have SSE-enabled code and are weighing the benefit of moving to a newer Intel® Xeon® processor-based system with AVX. The reference materials below can help developers learn how to migrate existing SSE code to Intel AVX code.

References:

 

For more complete information about compiler optimizations, see our Optimization Notice.