1. Executive Summary
3. Intel Xeon processor E5-2600 V3 product family enhancements.
3.1 Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions
3.2 Haswell New Instructions (HNI)
3.3 Support for DDR4 memory
3.4 Power Improvements
4. Grantley platform improvements
4.1 Intel® C610 Series Chipset (Wellsburg)
4.2 Virtualization features
4.3 New Security Features
4.4 Intel® Node Manager 3.0
About the Author
The Intel® Xeon® processor E5-2600 V3 product family, codenamed “Haswell EP”, is a 2-socket platform based on Intel’s most recent microarchitecture. This is the new “TOCK” based on 22nm process technology. This product brings additional capabilities for data centers: more cores, more memory and more bandwidth. As a result, platforms based on the Intel Xeon processor E5-2600 V3 product family will yield up to 33% improvement in performance1 compared to the previous generation “Ivy Bridge EP”. There are many features (hardware and software) new to this platform. On the hardware side there are additional cores and memory bandwidth, DDR4 memory support, power enhancements, virtualization enhancements and some security enhancements (System Management Mode external call trap) that can improve the application performance significantly without any enabling effort from the developers. On the software side there are HNI and AVX2. These software features require application enabling from the developers.
The Intel Xeon processor E5-2600 V3 product family is based on the Haswell microarchitecture, which brings several enhancements to the Ivy Bridge EP microarchitecture (http://software.intel.com/en-us/articles/intel-xeon-processor-e5-2600-v2-product-family-technical-overview). The platform supporting the Intel Xeon processor E5-2600 V3 product family is named “Grantley.” This paper discusses the new features available in the Intel Xeon processor E5-2600 V3 product family compared to the Intel Xeon processor E5-2600 V2 product family. Each section includes information about what developers need to do to take advantage of new features to improve application performance and security.
Figure 1: Intel® Xeon® processor E5-2600 V3 product family overview
Some of the new features that come with the Intel Xeon processor E5-2600 V3 product family include:
- Intel® Advanced Vector Extensions 2 (Intel® AVX2) instructions
- Haswell New Instructions (HNI)
- Support for DDR4 memory
- Power Management feature improvements
Figure 1 shows an overview of the Intel Xeon processor E5-2600 V3 product family microarchitecture. All processors in the family have up to 18 cores (compared to 12 cores in its predecessor), which bring additional computing power to the table. They also have additional cache (the top-bin SKU, the Intel® Xeon® E5-2699 v3 has 45 MB compared to 30 MB in Ivy Bridge) and more memory bandwidth.
Table 1. Comparison of the Intel® Xeon® processor E5–2600 V3 product family to the Intel® Xeon® processor E5–2600 V2 product family
The rest of this paper discusses some of the main enhancements in this product family.
With Intel AVX, all the floating point vector instructions were extended from 128 bit to 256 bits. Intel AVX 2 extends the integer vector instructions also to 256 bits. Intel AVX 2 uses the same 256 bit YMM registers as Intel AVX. AVX2 instructions benefit High Performance Computing (HPC) applications, Databases, audio and video applications. AVX 2 instructions include Fused Multiply Add (FMA), Gather, Shifts and Permute instructions.
The fused Multiply Add (FMA) instruction computes ±(a×b)±c with only one rounding. axb intermediate results are not rounded and therefore brings increased accuracy compared to MUL and ADD instructions. FMA increases performance and accuracy of many floating point computations such as matrix multiplication, dot product and polynomial evaluation. With 256 bits, we can have 8 single precision and 4 double precision FMA operations. Since FMA combines 2 operations into one, Floating Point Operations Per Second (FLOPS) are increased; additionally, because Haswell has 2 FMA units, the peak FLOPS are doubled.
The Gather instruction loads sparse elements to a single vector. It can gather 8 single precision (Dword) or 4 double precision (Qword) data elements into a vector register in a single operation. There is a base address that points to the data structure in memory. Index (offset) gives the offset of each element from the base address. The mask register tracks which element needs to be gathered. Gather is complete when mask register is all zeros. The gather instruction enables vectorization for workloads which previously weren’t able to be vectorized for various reasons.
Other new operations in Intel AVX2 include integer version of permute instructions, new Broadcasts instructions and Blend instructions.
Haswell new instructions include 4 Crypto instructions to speed up public key and SHA encryption algorithms and 12 (bit manipulation) instructions to speed up compression or signal processing algorithms. The bit manipulation instructions perform arbitrary bit field manipulations, leading and trailing zero bit counts, trailing set bit manipulations, improved rotates and arbitrary precision multiplies. They speedup algorithms performing bit field extract and packing, bit-granular encoded data processing (compression algorithms universal coding), arbitrary precision multiplication and hashes.
In order to use HNI, you need to use an updated compiler, as shown in the table below:
Table 2: Various Compiler support options for the new instructions
For more details on Intel® C++ Compiler, visit https://software.intel.com/en-us/intel-parallel-studio-xe
The ® Xeon® processor E5–2600 V3 product family supports both DDR3 and DDR4 memory. DDR4 can save up to 35% power compared to DDR3 (2 DIMMs per channel) and can bring up to 50% performance boost in bandwidth1.
The power improvements in Intel® Xeon® processor E5–2600 V3 product family include:
- Per core P-states (PCPS)
- Each core can be programmed to the Operating System (OS) requested P-state
- Uncore frequency scaling (UFS)
- The uncore frequency is independently controlled from the cores' frequencies
- Optimizing performance by applying power to where it is most needed
- Faster C-states
- When you wake a core out of a C3 or C6 state, it takes time. That time is faster on HSX. This reduces the overhead of doing c-states
- Lower Idle power
Contact your Operating System (OS) provider for details on which Operating Systems will support these features.
Some of the new features that come with Grantley platform include:
- Intel® C610 Series Chipset (Wellsburg)
- New Virtualization new features
- New Security new features
- Intel® Node Manager 3.0
The Grantley platform comes with Intel® C610 Series Chipset (Wellsburg) as opposed to Intel® C600 chipset (Patsburg) in the previous-generation Romley platform. The C610 chipset improves TDP and average power per package compared to C600.
Table 3 gives a comparison of features of C600 and C610 chipsets.
Table 3 Comparison of Patsburg and Wellsburg features
Virtualization feature improvements in Grantley platform include:
- Virtual Machine Control Structure (VMCS) shadowing
Nested virtualization allows a root Virtual Machine Monitor (VMM) to support guest VMMs. However, additional Virtual Machine (VM) exits can impact performance. VMCS shadowing directs the guest VMM VMREAD/VMWRITE to a VMCS shadow structure. This reduces nesting induced VM exits. VMCS shadowing increases efficiency by reducing virtualization latency.
Figure 2: VMCS Shadowing
This feature requires VMM enabling. Contact your VM provider to find out when this feature will be supported.
- Cache Monitoring Technology (CMT)
Cache Monitoring Technology (also known as “Noisy Neighbor” management) provides last level cache occupancy monitoring. This allows the VMM to identify cache occupancy at an individual application or VM level. With this information, virtualization software can make better decisions on workload scheduling and migration.
Figure 3: Cache Monitoring on Grantley platform
This feature requires VMM enabling. Contact your VM provider to find out when this feature will be supported.
- Extended Page Table (EPT) Access/Dirty (A/D) bits
In the previous generation Romley platform, Accessed and Dirty bits (A/D bits) are emulated in VMM and accessing them causes VM exits. Grantley implements EPT A/D bits in hardware to reduce VM exits. This enables efficient live migration of Virtual Machines and fault tolerance.
Figure 4: EPT A/D in HW
This feature requires VMM enabling. Contact your VM provider to find out when this feature will be supported
- VT-X latency reduction
Performance overheads arise from virtualization transition round trips - “exits” from VM to VMM and “entry” from VMM to VM due to handling of privileged instructions. There have been continuing efforts to bring down transition times from platform generation to generation. Grantley reduces the VMM overheads further and increase virtualization performance.
Security feature enhancements in the Grantley platform include:
- System Management Mode (SMM) external call trap (SECT)
System Management Mode (SMM) is an operating mode in which all normal execution (including the operating system) is suspended, and special separate software (usually firmware or a hardware-assisted debugger) is executed in high-privilege mode. SMM is entered to run handler code due to the SMI (system management interrupt). Without SMM External Call Trap (SECT), SMI Handler can execute code in user memory that can be malicious code. With SECT, SMI Handler can't invoke code in user memory.
Figure 5: SIMM external call trap
Bios level enabling is required to turn on this feature.
- General Crypto Assist - AVX2, 4th ALU, RORX for hashing
AVX2 (256-bit integer, better bit Manipulation, Permute Granularity) and 4th ALU (Arithmetic & Logical Unit) speed up all crypto algorithms. RORX accelerate hash algorithms. Please refer Intel® Architecture Instruction Set Extensions Programming Reference for more details on this instruction
- Asymmetric Crypto Assist – MULX for public key
New instruction (MULX) improves asymmetric crypto and eases more crypto challenges. Please refer Intel® Architecture Instruction Set Extensions Programming Reference or more details on this instruction
- Symmetric Crypto Assist – AES-NI optimization
Grantley includes enhancements and extensions for symmetric cryptography –Intel® AES-NI and beyond . Please refer to this article to find out more on AES-NI and how to use it.
- PCH-ME Digital Random Number Generator (DRNG)
The Manageability Engine (ME) is an independent and autonomous controller in the platform's architecture. ME requires well secured communication methods given its autonomy and level of access to low level platform mechanisms. Providing the ME with a high quality randomization source is necessary to maximize platform security. PCH-ME DRNG Technology provides real entropy and generates highly unpredictable random numbers for encryption use by ME, isolated from other system resources.
Grantley comes with the latest version of Intel® Node Manager, 3.0. The improvements in Node Manager 3.0 include:
- Predictive power limiting
- Power throttles engage predictively as system power approaches limit
- Power limit enforced during boot
- "Boot spike" is controlled without complex IT processes or disabling cores
- Power Management for Intel® Xeon Phi™ coprocessor
- Separate power limits & controls for Intel Xeon Phi coprocessor domain and rest-of-platform
- Node Manager Power Thermal Utility (PTU)
- Establishes key power characterization values for CPU and memory domains
- Delivered as a firmware
Please visit this link to get more details on Node Manager.
In summary, the Intel Xeon processor E5-2600 V3 product family combined with the Grantley platform provides many new and improved features that could significantly change your performance and power experience on enterprise platforms.
Sree Syamalakumari is a software engineer in the Software & Service Group at Intel Corporation. Sree holds a Master's degree in Computer Engineering from Wright State University, Dayton, Ohio.