By Quoc-Thai V Le,
Published:02/27/2014 Last Updated:02/27/2014
The latest Intel® Xeon® processor E7 v2 family includes a feature called Intel® Advanced Vector Extensions (Intel® AVX), which can potentially improve application performance. Here we will explain the context, and provide an example of how using Intel® AVX improved performance for a commonly known benchmark.
For existing vectorized code that uses floating point operations, you can gain a potential performance boost when running on newer platforms such as the Intel® Xeon® processor E7 v2 family, by doing one of the following:
Recompile your code, using the Intel® compiler with the proper AVX switch to convert existing SSE code. See the Intel® AVX State Transitions: Migrating SSE Code to AVX
Modify your code's function calls to leverage the Intel® Math Kernel Libraries (Intel® MKL) which are already optimized to use AVX where supported
Code in assembly instructions directly. For low level language (such as assembly) developers, you can use those equivalent AVX instruction from their existing SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details
In this article, I will share a simple experiment using the Intel® Optimized LINPACK benchmark to demonstrate the performance gain of three different sized workloads (30K, 40K, and 75K) from Intel AVX running on Windows* and Linux* operating systems. I will also share the list of AVX instructions that were executed and the equivalent SSE instructions for developers who are interested in direct coding.
I used the following platform for the experiment:
CPU & Chipset |
Model/Speed/Cache: E7-4890 v2 QFJY 2.8GHz, 37.5mb cache, 155W TDP, D1
|
Platform |
Brand/model: Intel SDP S4TR1SY2B Brickland IVT-EX Qual MM # 931237
|
Memory |
Memory Size: 256GB (32 x 8GB) - 4 DIMMS per memory riser card (Slot A0, B0, C0, D0) Brand/model: Samsung M393B1K70DH0-YK0 1309 DIMM info: 8GB 2Rx4 PC3L-1200R-11-11-E2-P2 |
Mass storage |
Brand & model: Intel SSD DC S3700 Series SSDSC2BA800G3 Number/Size/RPM/Cache: 2 in RAID 0 / 800gb |
Operating system |
Windows Server 2012 R2 / SLES 11 SP3 |
Procedure for running LINPACK:
What are the Intel AVX and the equivalent Intel SSE instructions that were executed?
Table 1 has a list of Intel AVX instructions that were executed during the Intel AVX runs. I have provided the equivalent Intel SSE instructions for those developers who are thinking of moving their existing Intel SSE code to Intel AVX.
Intel AVX Instructions |
Equivalent Intel SSE Instructions |
Definitions |
---|---|---|
VADDPD |
ADDPD |
Add Packed Double-Precision Floating-Point Values |
VBLENDPD |
BLENDPD |
Blend Packed Double Precision Floating-Point Values |
VBROADCASTSD |
N/A – Supported in AVX |
Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register. |
VDIVSD |
DIVSD |
Divide low double-precision floating-point value in xmm2 by low double-precision floating-point value in xmm3/m64 |
VEXTRACTF128 |
N/A – Supported in AVX |
Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand. |
VINSERTF128 |
N/A – Supported in AVX |
Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged. |
VMOVAPD |
MOVAPD |
Move Aligned Packed Double-Precision Floating-Point Values |
VMOVAPS |
MOVAPS |
Move Aligned Packed Single-Precision Floating-Point Values |
VMOVDDUP |
MOVDDUP |
Move One Double-FP and Duplicate |
VMOVDQU |
MOVDQU |
Move Unaligned Double Quadword |
VMOVHPD |
MOVHPD |
Move High Packed Double-Precision Floating-Point Value |
VMOVLPD |
MOVLPD |
Move Low Packed Double-Precision Floating-Point Value |
VMOVSD |
MOVSD |
Move Data from String to String |
VMOVUPD |
MOVUPD |
Move Unaligned Packed Double-Precision Floating-Point Values |
VMOVUPS |
MOVUPS |
Move Unaligned Packed Single-Precision Floating-Point Values |
VMULPD |
MULPD |
Multiply Packed Double-Precision Floating-Point Values |
VMULSD |
MULSD |
Multiply Scalar Double-Precision Floating-Point Values |
VPERM2F128 |
N/A – Supported in AVX |
Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector. |
VPERMILPD |
N/A – Supported in AVX |
Shuffle the 32-bit or 64-bit vector elements of one input operand with an immediate operand as selector. |
VPXOR |
VPXOR |
Logical Exclusive OR |
VSUBPD |
SUBPD |
Subtract Packed Double-Precision Floating-Point Values |
VSUBSD |
SUBSD |
Subtract Scalar Double-Precision Floating-Point Values |
VUCOMISD |
UCOMISD |
Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS |
VUNPCKHPD |
UNPCKHPD |
Unpack and Interleave High Packed Double-Precision Floating-Point Values |
VUNPCKLPD |
UNPCKLPD |
Unpack and Interleave Low Packed Double-Precision Floating-Point Values |
VXORPD |
XORPD |
Bitwise Logical XOR for Double-Precision Floating-Point Values |
VZEROUPPER |
N/A – Supported in AVX |
Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use. |
Table 1 – Intel® AVX and Equivalent Intel SSE Instructions
The list in Table 1 is just a subset of the AVX instructions available. The full list can be obtained from the Intel® 64 and IA-32 Architectures Optimization Reference Manual.
What is the performance gain for running the LINPACK benchmark with Intel AVX vs. Intel SSE enabled on the Intel Xeon E7 4890 v2 server?
Table 2 shows the results from the three different workloads running on Windows* and Linux*. In the Ratio column, the numbers show that the LINPACK benchmark produces ~1.6x-1.7x better performance when running with the combination of an Intel AVX optimized LINPACK and an Intel AVX capable processor. This is just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.
Windows* | Intel AVX (Gflops) | Intel SSE (Gflops | Ratio: Intel AVX/Intel SSE |
---|---|---|---|
LINPACK 30K v11.1.1 |
631.8 |
400.3 |
1.6 |
LINPACK 40K v11.1.1 |
756.4 |
480.6 |
1.6 |
LINPACK 75K v11.1.1 |
829.3 |
514.3 |
1.6 |
Linux* |
|
|
|
LINPACK 30K v11.1.1 |
913.6 |
534.3 |
1.7 |
LINPACK 40K v11.1.1 |
1023.5 |
621.2 |
1.6 |
LINPACK 75K v11.1.1 |
1128.8 |
657.0 |
1.7 |
Table 2 – Results and Performance Gain from the LINPACK benchmark
Conclusion
From our LINPACK experiment, we see compelling performance benefits when going to an AVX-enabled Intel Xeon processor; in this specific case, we saw a performance increase of ~1.6x-1.7x in our test environment, which is a strong case for developers who have SSE-enabled code and are weighing the benefit of moving to a newer Intel® Xeon® processor-based system with AVX. The reference materials below can help developers learn how to migrate existing SSE code to Intel AVX code.
References:
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804