The Intel® Xeon® processor E5-2600 v4 product family, code-named Broadwell EP, is a two-socket platform based on Intel’s most recent microarchitecture. Intel uses a “tick-tock” model associated with its generation of processors. This new generation is a “tick” based on 14nm process technology. Major architecture changes take place on a “tock,” while minor architecture changes and a die shrink occur on a “tick.”
Figure 1: “Tick-Tock” model.
In addition to a die shrink, an increase in processor cores, an increase in the memory bandwidth, and power enhancements, Broadwell has many new features compared to the previous-generation Haswell EP microarchitecture (Intel® Xeon® processor E5-2600 v3 product family). These features include architecture improvements with a Gather Index Table, Transition Lookaside Buffer (TLB), Instruction Set Architecture (ISA), Floating Point Instructions, and Intel® Transactional Synchronization Extensions (Intel® TSX) as well as new characteristics with virtualization, cryptographic, and security enhancements.
This paper discusses the new features available in the Intel Xeon processor E5-2600 v4 product family compared to the Intel Xeon processor E5-2600 v3 product family. It also describes what developers need to do to take advantage of these new features.
Figure 2: Overview of the Intel® Xeon® processor E5-2600 v4 product family microarchitecture.
The Intel Xeon processor E5-2600 v4 product family provides up to 22 cores, which bring additional computing power to the table compared to the 18 cores of its predecessor. Additional improvements include expanded last level cache (LLC), faster 2400 MHz DDR4 memory, support for 3D LRDIMMs, improved data integrity with detection of DDR4 bus faults during a write, a reduced Thermal Design Power (TDP), hardware-managed power-states, new RDSEED instruction, end-to-end data protection for transaction layer packets for the PCIe* I/O subsystem, new virtual technologies, and more.
Table 1: Generational comparison of the Intel® Xeon® processor E5-2600 v4 product family to the Intel® Xeon® processor E5-2600 v3 product family.
The rest of this paper discusses some of the new features in the Intel Xeon processor E5-2600 v4 product family. These features provide additional performance improvements, new capabilities, security enhancements, and virtualization enhancements.
Table 2: New features and technologies of the Intel® Xeon® processor E5-2600 v4 product family.
Intel AVX workloads have a lower processor frequency for the base and maximum turbo frequency as compared to non-Intel AVX workloads. On Haswell workloads that use a mixture of Intel AVX and non-Intel AVX code, all the cores on a processor socket are limited to the lower processor frequencies of the Intel AVX workload. Broadwell improves this situation by allowing non-Intel AVX code to run at its optimum processor frequency with a mixed workload.
Figure 3: Processor frequency comparisons for non-Intel® Advanced Vector Extensions (Intel® AVX), Intel AVX, and mixed workloads.
Broadwell introduces several improvements with floating point operations including the reduction of latency with vector floating point multiply operations MULPS and MUPLD from five cycles to three cycles. There have also been latency reductions for floating point divide operations DIVSS, DIVSD, DIVPS, and DIVPD. This benefits workloads that require precision when dealing with division of large floating point numbers such as some financial and scientific calculations. The latency reduction is possible due to the Radix-1024 divider, which has been increased in size providing the ability to compute 10 bits in each step.
A new split scalar operation provides the ability for scalar divides to be split into two segments and processed simultaneously improving the throughput cycles. See below for a multi-generation comparison of the benefit of the split operation on Broadwell versus Nehalem (Intel® Xeon® Processor 5500 Series), Sandy Bridge (Intel® Xeon® processor E5-2600 product family), Ivy Bridge (Intel® Xeon® processor E5-2600 v2 product family), and Haswell (Intel® Xeon® processor E5-2600 v3 product family).
Table 3: Generational comparison of latency and throughput (cycles) for floating point divide operations.
No recompilation is required to take advantage of these enhancements, allowing immediate benefits for existing code that already utilizes these types of operations. The Intel® Compiler 14.1+ and GCC 4.7+ support these instructions for those who want to gain access to additional benefits provided by Broadwell.
Software TLB improvements include an increase buffer size from 1 kB to 1.5 kB, and a native 16 entry array that handles 1 GByte page translations, which help in situations with large code or data footprints that have locality. The Branch Prediction Unit Target Array has been increased from 8 ways to 10 along with other improvements that help with address prediction for branches and returns. Included is a “Bottomless” return stack buffer (RSB), which uses indirect predictor to predict return address if the return stack underflows. Lastly store-to-load forwarding benefits from an increase in size from 60 entries to 64 on the out-of-order scheduler.
Instruction Set Architecture (ISA) changes include micro-op reductions for several instructions, which speed up performance with cryptography. ADC, CMOV, and PCLMULQDQ instructions have each been reduced to one micro-op. The ADC instruction is helpful with emulating large number arithmetic, the CMOV instruction helps with a conditional move, and the PCLMULQDQ instruction helps with cryptographic and hashing. The VCVTPS2PH (mem form) instruction has been reduced from 4 micro-ops to 3 micro-ops.
No recompilation is required to take advantage of these enhancements, allowing immediate benefits for existing code that already utilize these types of operations. The Intel Compiler 14.1+ and GCC 4.7+ support these instructions for those who want to gain access to additional benefits provided by Broadwell.
Broadwell adds additional hardware capability with a gather index table (GIT) to improve performance. The GIT provides storage for full width indices near the address generation unit. A special load grabs the correct index, simplifying the index handling. Loaded elements are merged directly into the destination. These improvements provide a significant reduction in micro-ops versus the previous generation of silicon, approximately 60 percent fewer micro-ops. These improvements provide a significant reduction in overhead and can reduce the latency of the gather operation by approximately 60 percent.
No recompilation is required to take advantage of this new feature, allowing immediate benefits for existing code that already utilize these types of operations. The Intel Compiler 14.1+ and GCC 4.7+ support these instructions for those who want to gain access to additional benefits provided by Broadwell.
Figure 4: Gather index table conceptual block diagram.
This technology was previously introduced on the Intel® Xeon® processor E7 v3 family and is now available on the Intel Xeon processor E5-2600 v4 product family. Intel TSX provides a set of instruction set extensions that allow programmers to specify regions of code for transactional synchronization. Programmers can use these extensions to achieve the performance of fine-grain locking while actually programming using coarse-grain locks.
Intel TSX provides two software interfaces. The first, called Hardware Lock Elision (HLE), is a legacy-compatible instruction set extension (comprising the XACQUIRE and XRELEASE prefixes) that are used to specify transactional regions. HLE is compatible with the conventional lock-based programming model. Software written using the HLE hints can run on both legacy hardware without Intel TSX and new hardware with Intel TSX. The second, called Restricted Transactional Memory (RTM) is a new instruction set interface (comprising the XBEGIN, XEND, and XABORT instructions) that allows programmers to define transactional regions in a more flexible manner than is possible with HLE. Unlike the HLE extensions, but just like most new instruction set extensions, the RTM instructions will generate an undefined instruction exception (#UD) on older processors that do not support RTM. RTM also requires the programmer to provide an alternate code path for a transactional execution that is not successful.
Figure 5: Lock boundaries for critical sections of code for a given thread and how the lock appears free throughout from the perspective of the hash table.
For an overview on Intel TSX, see Transactional Synchronization in Haswell. The Intel® Architecture Instruction Set Extensions Programming Reference describes these extensions in detail and outlines various programming considerations to get the most out of them.
Virtual Technology Enhancements with Cache Monitoring Technology, Cache Allocation Technology, and Memory Bandwidth Monitoring
Intel® Resource Director Technology (Intel® RDT) is a set of technologies designed to help monitor and manage shared resources. See Optimize Resource Utilization with Intel® Resource Director Technology for an animation illustrating the key principles behind Intel RDT. Haswell introduced a new Cache Monitoring Technology (CMT) feature. Broadwell provides further expansion of virtual technology with Cache Allocation Technology (CAT), Memory Bandwidth Monitoring (MBM), and Code and Data Prioritization (CDP). These new features help to address the lack of hardware support for the operating system or the Virtual Machine Manager (VMM) to deal with shared resources on the server. Chapters 17.15 and 17.16 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) cover programming details on CMT, CAT, MBM, and CDP.
Figure 6: Cache and memory bandwidth monitoring and enforcement vectors.
Cache Monitoring Technology allows for monitoring of the Last Level Cache on a per-thread, application, or virtual machine (VM) basis. Misbehaving threads can be isolated to increase performance. On Haswell the information gleaned via CMT could be used by a scheduler to implement migrate a problematic thread/application/VM. With Broadwell, CAT makes this process easier. For more detailed information, see Chapter 17.15 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM). Using this feature requires enabling at the OS or VMM level, and the Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x) feature must be enabled at the BIOS level. For instructions on setting Intel VT-x, refer to your OEM BIOS guide.
Figure 7: Generational comparison with and without Cache Allocation Technology.
Cache Allocation Technology allows the OS to specify how much cache space an application can utilize on a per-thread, application, or VM basis allowing the VMM or OS scheduler to make changes based on policy enforcement. This feature can be beneficial in a multi-tenant environment when a VM is causing a lot of thrash with the cache. The VMM or OS can migrate this “noisy neighbor” to a different location where it may have less of an impact on other VMs. CAT introduces a new capability to manage the processor LLC based on pre-defined levels of service, independent of the OS or VMM. A QoS mask can be used to provide 16 different levels of enforcement to limit the amount of cache that a thread can consume. The CPUID function is used for enumeration of cache allocation functionality. IA32_L3_QOS_MASK_n is a model-specific register used to configure the class of service. IA32_PQR_ASSOC is a model-specific register used to associate a core/thread/application with a configuration. For more detailed information see Chapter 17.16 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM). Using this feature requires enabling at the OS or VMM level, and the Intel VT-x feature must be enabled at the BIOS level. For instructions on setting Intel VT-x, refer to your OEM BIOS guide.
Figure 8: Generation-to-generation comparison with and without Cache Allocation Technology.
Memory Bandwidth Monitoring enables the OS or VMM to monitor memory bandwidth on a per-core or thread basis allowing for the OS or VMM to make scheduling decisions. An example of this situation is when one core is being heavily utilized by two applications, while another core is being underutilized by two other applications. With memory bandwidth monitoring the OS or VMM now has the ability to schedule a VM or an application to a different core to balance out memory bandwidth utilization. In Figure 9 high memory bandwidth applications are competing for the same resource. The OS or VMM can move one of the high bandwidth memory applications to another resource to balance out the load. For more detailed information, see Chapter17.15 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM). Using this feature requires enabling at the OS or VMM level, and the Intel VT-x feature must be enabled at the BIOS level. For instructions on setting Intel VT-x, refer to your OEM BIOS guide.
Figure 9: Generation-to-generation comparison with and without Memory Bandwidth Monitoring.
Code and Data Prioritization technology is an extension of CAT. CDP enables isolation and separate prioritization of code and data fetches to the L3 cache in a software configurable manner, which can enable workload prioritization and tuning of cache capacity to the characteristics of the workload. CDP extends CAT by providing separate code and data masks per Class of Service (COS). This can assist with optimizing the relationship between the last level cache and a given workload. For more detailed information, see Chapter 17.16.2 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM). Using this feature requires enabling at the OS or VMM level, and the Intel VT-x feature must be enabled at the BIOS level. For instructions on setting Intel VT-x, refer to your OEM BIOS guide.
Figure 10: Capacity bitmasks allow for separation of the code and data.
ADCX (unsigned integer add with carry) and ADOX (unsigned integer add with overflow) have been introduced for Asymmetric Crypto Assist1 in addition to faster ADC/SSB instructions (no re-compilation required for ADC/SSB benefits). ADCX and ADOX are extensions of ADC (add with carry) and ADO (add with overflow) instructions for use in large integer arithmetic, greater than 64 bits. Performance improvements are due to two parallel carry chains being supported at the same time. ADOX/ADCX can be combined with MULX for additional performance improvements with public key encryption such as RSA. Large integer arithmetic is also used for Elliptic Curve Cryptography (ECC) and Diffie-Hellman (DH) Key Exchange. Beyond cryptography, there are many use cases in complex research and high-performance computing. The demand for this functionality is high enough to warrant a number of commonly used optimized libraries, such as the GNU Multi-Precision (GMP) library (for example, Mathematica), see New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors. To take advantage of these new instructions, you need to obtain a new software library and recompilation (Intel Compiler 14.1+ and GCC 4.7+). For more information about these instructions, see the Intel® 64 and IA-32 Architectures Developer’s Manual.
1Intel® processors do not contain crypto algorithms, but support math functionality that accelerates the sub-operations.
The RDSEED instruction is intended for seeding a Pseudorandom Number Generator (PRNG) of arbitrary width, which can be useful when you want to create stronger cryptography keys. If you do not need to seed another PRNG, use the RDSEED instruction. For more information see Table 4, Figure 11, and The Difference Between RDRAND and RDSEED.
Table 4: RDSEED and RDRAND compliance and source information.
Cryptographically secure pseudorandom number generator
Non-deterministic random bit generator
SP 800-90B & C (drafts)
Figure 11: RDSEED and RDRAND conceptual block diagram.
The Intel® Compiler 15+ and GCC 4.8+ support RDSEED.
RDSEED loads a hardware-generated random value and stores it in the destination register. The random value is generated from an Enhanced NRBG (Non Deterministic Random Bit Generator) that is compliant with NIST SP 800-90B and INST SP 800-90C in the XOR construction mode.
In order for the hardware design to meet its security goals, the random number generator continuously tests itself and the random data it is generating. The self-test hardware detects runtime failures in the random number generator circuitry or statistically anomalous data occurring by chance and flags the resulting data as bad. In such extremely rare cases, the RDSEED instruction will return no data instead of bad data.
Intel C/C++ Compiler Intrinsic Equivalent:
RDSEED int_rdseed16_step( unsigned short * );
RDSEED int_rdseed32_step( unsigned int * );
RDSEED int_rdseed64_step( unsigned __int64 *);
As with RDRAND, RDSEED will avoid any OS- or library-enabling dependencies and can be used directly by any software at any protection level or processor state.
For more information see section 184.108.40.206 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM).
Supervisor Mode Access Protection (SMAP) is a new CPU-based mechanism for user-mode address-space protection. It extends the protection that previously was provided by Supervisor Mode Execution Prevention (SMEP). SMEP prevents supervisor mode execution from user pages, while SMAP prevents unintended supervisor mode accesses to data on user pages. There are legitimate instances where the OS needs to access user pages, and SMAP does provide support for those situations
Figure 12: SMAP conceptual diagram.
The ability to maintain the coherence of shared resource data stored in multiple caches has become more difficult over time with the evolving complexity of the microarchitecture. Home agents and cache agents work together to maintain coherence of the memory data and cache lines between processor sockets.
Broadwell offers four different snoop modes a reintroduction of Home Snoop with Directory and Opportunistic Snoop Broadcast (HS with DIR + OSB) previously available on Ivy Bridge, and three snoop modes that were available on Haswell, Early Snoop, Home Snoop, and Cluster on Die Mode (COD). Table 5 maps the memory bandwidth and latency trade-offs that will vary across each of the different modes. Most workloads will find that Home Snoop with Directory and Opportunistic Snoop Broadcast will be the best choice.
Table 5: Comparison of Snoop Mode Characteristics. Higher is better for memory bandwidth. Lower is better for memory latency.
*Depends on the directory state. Clean directory – low latency; Dirty directory – high latency
+Local latencies are snoop bound
Early Snoop mode always uses the cache agent to generate the snoop request. The request is broadcast to the other cache agents, which creates a lot of traffic. Although memory latency is better compared to Home Snoop mode, the memory bandwidth is worse due to the amount of broadcast traffic.
Home Snoop mode always uses the home agent for the memory controller to generate the snoop request. This method creates higher local memory latencies, but because the snoop is not being broadcasted by the caching agent there is less snoop traffic. This means there is more available memory bandwidth as compared to Early Snoop mode.
Home Snoop with Directory and Opportunistic Snoop Broadcast combines multiple features and generally will be the best snoop mode for most workloads. Each home agent has a small cache that holds the directory state of migratory cache lines. The home agent will speculatively snoop the remote socket in parallel with the directory read. This enables low cache-to-cache latencies, low memory latencies and higher memory bandwidth. It is also used to minimize directory lookup overhead for non-temporal writes.
Cluster on Die Mode (COD) was introduced with Haswell and is found on processors with 10 cores or higher. COD supplies two home agents that provide a second NUMA node per processor socket. For highly optimized NUMA workloads where latency is more important than sharing data across the caching agents, COD can help improve reduced latencies with average LLC hits and local memory access. Because the number of hardware threads is split between two home agents, this can lead to higher memory bandwidth. The affinity decisions based on the number of NUMA nodes is owned by the OS or VMM.
Refer to your OEM BIOS guide for instructions on setting this feature. Typically it will be a selectable option in the Advanced CPU or QPI menus of your BIOS.
Posted Interrupts enables efficient co-migration of interrupts with virtual processors avoiding the need for a VM-exit. When a sequence of external interrupts are sent to the VM, they are treated like a posted write and stored in memory. Since posted interrupts are directly supported by the hardware there is a reduction in the number of VM-exits that occur as compared to using software to resolve the interrupt. Posted interrupts is also complimentary to APIC Virtualization, further improving virtual-interrupt performance.
Figure 13: Comparison of software-based interrupt handling against APIC Virtualization, which was introduced on Ivy Bridge, and lastly with Broadwell, which provides additional support with posted interrupt support in the hardware.
Refer to your OEM BIOS guide for instructions on setting the Intel VT-x feature in your BIOS. Contact your VMM provider to verify support of this feature.
In Haswell the EPT Accessed and Dirty bits were implemented in the hardware to reduce the number of VM-exits. This enabled more efficient live migration of VMs and fault tolerance. Broadwell has added Page Modification Logging to keep track of these events. This can help provide a method to reduce overhead through rapid check pointing on fault tolerance based VMs. It can also help maintain the availability for critical workloads, while providing prioritization in a mixed workload environment.
Refer to your OEM BIOS guide for instructions on setting the Intel VT-x feature in your BIOS. Contact your VMM provider to verify support of this feature.
Broadwell introduces Hardware Power Management (HWPM), a new optional processor power management feature in the hardware that liberates the OS from making decisions about processor frequency. HWPM allows the platform to provide information on all available constraints, allowing the hardware to choose optimal operating point. Operating independently, the hardware uses information that is not available to software and is able to make a more optimized decision in regard to the p-states and c-states. For example, the performance profile improves latency response when demand changes, while the energy profile delivers optimal energy efficiency and potentially provides power savings (see Figure 14).
Figure 14: Comparison of performance and power of HWPM versus without HWPM.
Note: Performance and power differences between HWPM and without HWPM may vary based on workload.
Refer to your OEM BIOS guide for instructions on setting this feature. Typically it will be a selectable option in the power profile menu of the BIOS labeled HWPM OOB.
Intel® Processor Trace (Intel® PT) is an exciting feature supported on Broadwell that can be enormously helpful in debugging, because it exposes an accurate and detailed trace of activity with triggering and filtering capabilities to help with isolating the tracing that matters.
Intel PT provides the context around all kinds of events. Performance profilers can use Intel PT to discover the root causes of “response-time” issues—performance issues that affect the quality of execution, if not the overall runtime.
Further, the complete tracing provided by Intel PT enables a much deeper view into execution than has previously been commonly available; for example, loop behavior, from entry and exit down to specific backedges and loop tripcounts, is easy to extract and report.
Debuggers can use it to reconstruct the code flow that led to the current location. Whether this is a crash site, a breakpoint, a watchpoint, or simply the instruction following a function call we just stepped over. They may even allow navigating in the recorded execution history via reverse stepping commands.
Another important use case is debugging stack corruptions. When the call stack has been corrupted, normal frame unwinding usually fails or may not produce reliable results. Intel PT can be used to reconstruct the stack back trace based on actual CALL and RET instructions.
Operating systems could include Intel PT into core files. This would allow debuggers to not only inspect the program state at the time of the crash, but also to reconstruct the control flow that led to the crash. It is also possible to extend this to the whole system to debug kernel panics and other system hangs. Intel PT can trace globally so that when an OS crash occurs, the trace can be saved as part of an OS crash dump mechanism and then used later to reconstruct the failure.
Intel PT can also help to narrow down data races in multi-threaded operating systems and user program code. It can log the execution of all threads with a rough time indication. While it is not precise enough to detect data races automatically, it can give enough information to aid in the analysis.
To utilize Intel PT you need Intel® Vtune™ Amplifier version 2015 Update 1 or greater.
For more information see Debug and fine-grain profiling with Intel processor trace given by Beeman Strong, Senior and Processor tracing by James Reinders.
Intel® Node Manager
Intel® Node Manager is a core set of power management features that provide a smart way to optimize and manage power, cooling, and compute resources in the data center. This server management technology extends component instrumentation to the platform level and can be used to make the most of every watt consumed in the data center. First, Intel Node Manager reports vital platform information, such as power, temperature, and resource utilization using standards-based, out-of-band communications. Second, it provides fine-grained controls to limit platform power in compliance with IT policy. This feature can be found across Intel’s product segments, including Broadwell, providing consistency within the data center.
To use this feature you must enable the BMC LAN and the associated BMC user configuration at the BIOS level, which should be available under the server management menu. The Programmer’s Reference Kit is simple to use and requires no additional external libraries to compile or run. All that is needed is a C/C++ compiler and to then run the configuration and compilation scripts.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804