Intel® Xeon® Processor E5-2600/4600 Product Family Technical Overview

The Intel® Xeon® processor E5-2600/4600 product family is based on Sandy Bridge EP microarchitecture which is an evolution of the Nehalem EP microarchitecture. The platform supporting the Intel® Xeon® processor E5-2600/4600 product family is named “Romley.”

Contents

1. Executive Summary.

2. Introduction.

3. The Intel Xeon processor E5-2600/4600 product family Enhancements.

3.1 Major Processor Core Enhancements.

3.1.1 Front End Enhancements.

3.1.2 Out of Order/Execution Enhancements.

3.1.3 Memory Cluster Enhancements.

3.2 Intel® Advanced Vector Extensions (Intel® AVX)

3.3 Scalable Ring/Cache Architecture.

3.4 Intel® Turbo Boost Technology 2.0 (Intel Turbo 2.0)

4 Romley Platform Enhancements.

4.1 Intel® C600 Chipset

4.2 PCI Express* 3.0.

4.3 Intel® Node Manager 2.0 (Node Manager 2.0)

4.4 Security Technologies Enhancements.

4.5 Intel® Virtualization Technology Enhancements (Intel® VT)

5. Conclusion.

6. About the Author

1. Executive Summary

The Intel® Xeon® processor E5-2600 product family, formerly codenamed “Sandy Bridge EP”, is a 2 socket platform based on Intel’s most recent microarchitecture. This product brings additional capabilities for data centers: more cores, more memory, more integration and more bandwidth. As a result, platforms based on the Intel Xeon processor E5-2600 product family will yield up to 80% improvement in performance and up to 50% improvement in power efficiency compared to the previous generation Intel Xeon processor 5600 series. The Intel Xeon processor E5-4600 product family is the 4 socket version of E5-2600 product family with additional sockets and memory. Platforms based on the Intel Xeon processor E5-4600 product family will yield up to 80% improvement in performance compared to E5-2600 family.

2. Introduction

The Intel Xeon processor E5-2600/4600 product family is based on Sandy Bridge EP microarchitecture which is an evolution of the Nehalem EP microarchitecture. The platform supporting the Intel Xeon processor E5-2600/4600 product family is named “Romley”.

Figure 1: Intel® Xeon® processor E5-2600 product family

Figure 2: Intel® Xeon® processor E5-4600 product family

The first half of this paper discusses the new features available in the Intel Xeon processor E5-2600/4600 product family cores. The second half discusses the new features available with the Romley platform. Each section includes what a developer has to do to make use of this feature to improve the performance of his or her application.

3. The Intel Xeon processor E5-2600/4600 product family Enhancements

Some of the new features that come with The Intel Xeon processor E5-2600/4600 product family include:

  1. 32 nm process technology
  2. Intel® Advanced Vector Extensions (Intel® AVX)
  3. Intel® Turbo Boost Technology 2.0
  4. High Bandwidth Last Level Cache
  5. High Bandwidth/Low Latency modular on-die Ring Interconnect
  6. Integrated Memory Controller with 4 channel DDR3
  7. CPU and PCI Express* integrated on single chip

Figure 3: The Intel® Xeon® processor E5-2600/4600 product family Microarchitecture

Figure 3 is a block diagram of the Intel Xeon processor E5-2600/4600 product family microarchitecture showing some of the new features. The Intel Xeon processor E5-2600/4600 product family comes with up to 8 cores (compared to 6 cores in its predecessor Intel Xeon processor 5600 series), which bring additional computing power to the table, and includes features such as on-die interconnect, greater socket to socket bandwidth, and higher cache bandwidth. With all the power savings that comes with the new microarchitecture, this platform brings more performance and the same Thermal Design Power (TDP). Note that except for Intel AVX, a developer can make use of most of these enhancements without making any changes to their software applications.

Table 1 shows a comparison of the Intel Xeon processor E5-2600/4600 product family features compared to the predecessor Intel Xeon processor 5600 series.

Intel® Xeon® 5600 Series

Intel® Xeon® E5-2600/4600
Product Family

Cores

Up to 6 cores / 12 threads

Up to 8 cores / 16 threads

Physical Addressing

40b (Uncore limited)

46b (Core and Uncore)

Cache Size

Up to 12 MB

Up to 20 MB

L3 Cache bandwidth

32 bytes/clock

192 bytes/clock

Max Memory Channels per Socket

3

4

Max Memory Speed

1333

1600

Virtualization Technology

Adds Real Mode support and transition latency reduction

Adds Large VT pages

New Instructions

Intel® AES-NI, PCMULQDQ

Adds Intel® AVX

QPI frequency

6.4GT/s

8.0GT/s

Server/Workstation TDP

Server/Workstation:

130W, 95W, 80W, LV (Low Power)

150W (E5-2600 Workstation Only)

135W (E5-2600), 130W, 115W, 95W, 80W, LV (Low Power)

Table 1 Intel® Xeon® processor 5600 series to Intel® Xeon® processor E5-2600/4600 product family

The rest of this paper discusses some of the main enhancements in this processor/platform.

3.1 Major Processor Core Enhancements

The main microarchitecture improvements in the Intel Xeon processor E5-2600/4600 product family core include:

  • Front end enhancements
  • Out of Order/Execution enhancements
  • Memory cluster enhancements
3.1.1 Front End Enhancements

The front end unit of the Intel Xeon processor E5-2600/4600 product family core includes the same 32K 8-way associative I-cache and the 4 decoder units which can decode up to 4 instructions per cycle as its predecessor. Enhancements include an additional decoded Uop cache and a ‘ground up’ rebuild of the branch predictor.

The decoded Uop cache is of size 1.5K (equivalent to 6K I-cache) and is fully included in the 32K I-cache. This stores already decoded Uops, which avoids multiple decoding of instructions. Most applications experience an 80% hit rate on this cache, which will result in considerable performance improvement. Also, the legacy pipeline is shutdown during this time, which will bring power savings as well. Overall, with the addition of the decoded Uop cache, applications can achieve better performance while consuming less power. A developer benefit from these features without making any changes to application code.

The improved branch predictor in the Intel Xeon processor E5-2600/4600 product family core includes twice as many targets, lots of compression, longer history and longer memory. This will reduce the number of branch mispredictions, again helping applications achieve better performance with no code changes.

3.1.2 Out of Order/Execution Enhancements

The “Sandy Bridge” microarchitecture introduces the concept of Physical Register File (PRF) to replace Nehalem’s centralized Retirement Register File (RRF).

This approach is substantially more power efficient because it keeps a single copy of every data and eliminates the movement of data values after calculation. PRF is a key enabler of making Out of Order unit (OOO) larger and for Intel Advanced Vector Extensions (Intel AVX). PRF is enabled in the Sandy Bridge processor by default.

In order to double floating point throughput, the “Sandy Bridge” microarchitecture introduces Intel Advanced Vector Extensions (Intel AVX) instructions. Intel AVX extends the Intel® SSE floating point instruction set to 256 bits operand size and as a result, the execution units are able to handle 256 bit floating point operands.

For doubling the output of one of the execution stacks (for incorporating 256 bit Intel AVX instructions), the Intel Xeon processor E5-2600/4600 product family repurposes existing data paths to dual use Single Instruction Multiple Data (SIMD) integer and legacy SIMD FP use legacy stack style, Intel AVX utilizes both 128-bit execution stacks. As a result the OOO execution window becomes bigger, resulting in better performance.

Please refer to section 3.2 to learn more about how to take advantage of Intel AVX instructions.

3.1.3 Memory Cluster Enhancements

There are 3 ports going to the memory unit – load, store and store data. In order to serve twice the bandwidth, the memory ports were made symmetric in the Intel Xeon processor E5-2600/4600 product family so that memory unit can service 3 data accesses per cycle – 2 loads of up to 16 bytes and 1 store of 16 bytes (used to be 2 data accesses per cycle).

Since a typical application performs more loads than stores, doubling the number of loads per cycle makes perfect sense and should result in performance improvement for most applications.

3.2 Intel® Advanced Vector Extensions (Intel® AVX)

The “Sandy Bridge” microarchitecture introduces Intel AVX, a new-256 bit instruction set extension to Intel SSE designed for applications that are Floating Point (FP) intensive.

Figure 4 Intel® AVX Instruction Format

Intel AVX introduces the following architectural enhancements:

  • Support for 256-bit wide vectors and SIMD register set.
  • Instruction syntax support for generalized three-operand syntax to improve instruction programming flexibility and efficient encoding of new instruction extensions.
  • Enhancement of legacy 128-bit SIMD instruction extensions to support three-operand syntax and to simplify compiler vectorization of high-level language expressions.
  • Instruction encoding format using a new prefix (referred to as VEX) to provide compact, efficient encoding for three-operand syntax, vector lengths, compaction of SIMD prefixes and REX functionality.
  • FMA extensions and enhanced floating-point compare instructions add support for IEEE-754-2008 standard.

Intel AVX employs an instruction encoding scheme using a new prefix (known as a “VEX” prefix). Instruction encoding using the VEX prefix can directly encode a register operand within the VEX prefix. This supports two new instruction syntaxes in Intel 64 architecture:

  • A non-destructive operand (in a three-operand instruction syntax): The non-destructive source reduces the number of registers, register-register copies and explicit load operations required in typical SSE loops, reduces code size, and improves micro-fusion opportunities.
  • A third source operand (in a four-operand instruction syntax) via the upper 4 bits in an 8-bit immediate field.

Two-operand instruction syntax previously expressed as

ADDPS xmm1, xmm2/m128

now can be expressed in three-operand syntax as

VADDPS xmm1, xmm2, xmm3/m128

In four-operand syntax, the extra register operand is encoded in the immediate byte. The introduction of three-operand and four-operand syntaxes helps to reduce the number of register to register copies, thus making the programming more efficient.

Intel AVX also brings some new data manipulation and arithmetic compute primitives, including broadcast, permute, fused-multiply-add, etc

Intel AVX improves performance due to wider vectors, new extensible syntax, and rich functionality which results in better data management. Applications that could benefit from Intel AVX include general purpose applications like image, audio/video processing, scientific simulations, financial analytics and 3D modeling and analysis.

Operating system and compiler support are needed for executing applications with Intel AVX. Some of the supporting operating systems include Linux* 2.6.30 or later, Windows 7* SP1 or later and Windows* 2008 server SP1 or later. The compilers supporting Intel AVX include Intel C/C++ and Fortran Compilers version 11.1 or later, Microsoft* Visual Studio 2010 or later and GCC* 4.4.1 or later.

There are a couple of ways a developer can make use of Intel AVX in their applications:

  1. Re-compiling the application with the appropriate compiler – if the developer doesn’t want to modify his code, he can re-build his application using the appropriate compiler (mentioned above) using the right switches to turn on AVX optimizations. On Windows, using Intel compiler, use the command line switch /QxAVX. On Linux, use –xavx. The switches /QaxAVX (Windows) and –axavx (Linux) may be used to build applications that will take advantage of AVX instructions on Intel systems that support these, but will use only SSE instructions on other Intel or compatible, non-Intel systems. For Microsoft Visual Studio compiler, use the flag /arch:AVX to enable AVX optimizations.
  2. Hand-optimizing the application using intrinsics – the developer could modify relevant portions of his software code using intrinsic instructions. Please refer to http://software.intel.com/en-us/articles/intel-intrinsics-guide for more details on the intrinsics. The Intel® C++ Compiler supports Intel AVX-based intrinsics via the header file immintrin.h.

To best illustrate how AVX can be used, here is an example of how AVX was used to significantly improve performance of a financial services application: http://software.intel.com/en-us/articles/case-study-computing-black-scholes-with-intel-advanced-vector-extensions

For more details on Intel AVX, please go to http://software.intel.com/en-us/avx

3.3 Scalable Ring/Cache Architecture

Intel introduced ring topology in Nehalem-EX and Westmere-EX Scalability was improved with The Intel Xeon processor E5-2600/4600 product family’s ring-style interconnect links cores, last level cache, PCIe and Integrated Memory Controller (IMC). The bus is made up of four independent rings - a data ring, request ring, acknowledge ring and snoop ring.

Figure 5 Ring architecture in the Intel® Xeon® processor E5-2600/4600 product family

The L3 cache is divided into slices, one associated with each core although each core can address the entire cache. Each slice gets its own stop and each slice has a full cache pipeline. In the Intel Xeon processor 5600 series there was a single cache pipeline and queue that all cores forwarded requests to. In the Intel Xeon processor E5-2600/4600 product family it’s distributed per cache slice.

Because each core in the Intel Xeon processor E5-2600/4600 product family has some L3 cache and a ring stop associated with it, cache bandwidth grows with the core count. At 3GHz, each stop can transfer up to 96 GB/s, so a dual-core the Intel Xeon processor E5-2600/4600 product family implementation peaks at 192 GB/s of last-level cache bandwidth, while the quad-core variant peaks at 384 GB/s. L3 latency is significantly reduced from around 36 cycles in the prior generation to 26 - 31 cycles. Also unlike the prior generation, the L3 cache now runs at the core clock speed - the concept of the un-core still exists but Intel calls it the “system agent” instead and it no longer includes the L3 cache.

In summary, the ring architecture in the Intel Xeon processor E5-2600/4600 product family brings high bandwidth and scalability, and low latency – extremely useful for applications limited by bandwidth.

3.4 Intel® Turbo Boost Technology 2.0 (Intel Turbo 2.0)

Until the Introduction of the Intel Xeon processor E5-2600/4600 product family, thermal heat dissipation was considered as steady state (as soon as you increase the power of a system component, it goes to the maximum temperature state). The Intel Xeon processor E5-2600/4600 product family brings a new concept into the picture - called “thermal capacitance”. The power algorithms in these processors are based on the fact that in reality, system components don’t heat up immediately. When you raise the power, you are not going immediately max out the temperature. In between, we have what we call a “thermal budget”.

Most real world applications experience periods where they require maximum processor performance, and periods where they are idle. Many exhibit a responsive behavior where idle periods are mixed with user actions. Intel Turbo 2.0 is a perfect fit for these types of applications. During idle periods, the thermal budget is built up. As processor demand goes up, instead of going to the thermal design power (TDP), the processor can boost up to a much higher power level for a brief period. The processor still operates within a closely controlled thermal envelope, controlled by fuse settings and highly-accurate sensors. The brief period in this higher power level can be up to 25 seconds depending on how idle the application was in the beginning and for how long.

Figure 6 Intel Turbo Boost Technology 2.0

4 Romley Platform Enhancements

The following sections discuss some of the new features that come with the Romley platform and how a developer can take advantage of them.

4.1 Intel® C600 Chipset

The Intel® C600 series chipsets support the Intel Xeon processor E5 family for servers and workstations. The Intel® C600 chipset specifically supports eight flexible PCI Express* 2.0 x1 ports, up to eight SAS 3 Gb/s ports, two SATA 6 Gb/s ports, and four SATA 3 Gb/s ports, fourteen USB 2.0 ports, Legacy PCI, Intel® Rapid Storage Technology enterprise (Intel® RSTe), and hardware XOR acceleration for RAID. These chipsets also offer optional support for Intel® Intelligent Power Node Manager, Intel® Active Management Technology, or Intel® vPro™ technology.

From a developer’s perspective, all these features can be utilized just by running the application on the Romley platform – no code changes are required.

4.2 PCI Express* 3.0

PCIe 3.0 is launched first on Intel Xeon processor E5-2600/4600 product family based servers and workstations. PCIe 3.0 delivers double the bandwidth compared to its predecessor PCIe Gen2. PCIe 3.0 is backward compatible with both PCIe Gen1 and PCIe Gen2 and uses the same connectors.

4.3 Intel® Node Manager 2.0 (Node Manager 2

Node Manager is a power reporting and capping technology delivered as firmware running on the Intel® chipset. It provides directed power management features that complement a Baseboard Management Controller (BMC) or manageability controller functionality (Figure 7).

Figure 7 Power capping and reporting using Node Manager

Intel Xeon processor E5-2600/4600 product family based platforms can support Node Manager 2.0. Some of the improvements in Node Manager 2.0 compared to the previous version (1.5) are:

  • Node Manager 2.0 utilizes CPU and memory subsystem “knobs” to achieve policy directives (Node Manager 1.5 used Operating System Power Management (OSPM)) – this brings increased reliability since power capping happens independently from Operating System failures and the power readings come directly from CPU registers.
  • Additional power control mechanisms, including memory power limiting and dynamic core allocation
  • Power supply optimizations
  • Boot time power capping – lowers power consumption at boot time.

To use the Node Manager functionality, applications need to rely on management console software enabled by Intel® Data Center Manager or Node Manager APIs.

For more details on this technology, please visit:

www.intel.com/technology/nodemanager

Here is a link to a case study on how an application can achieve power savings using Node Manager:

http://software.intel.com/en-us/articles/energy-efficiency-using-intel-intelligent-power-node-manager/

4.4 Security Technologies Enhancements

The enhanced security solutions on Romley platform include:

  • Greater Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) efficiency
  • More robust solutions using Intel® Trusted Execution Technology (Intel® TXT)
    • Greater hypervisor and OS support
    • Launch control policy enhancements
    • Improved integration in management consoles

To understand more about Intel® AES-NI, please visit:

http://www.intel.com/content/www/us/en/architecture-and-technology/advanced-encryption-standard--aes-/data-protection-aes-general-technology.html

For more details on Intel® TXT, refer to www.intel.com/txt

4.5 Intel® Virtualization Technology Enhancements (Intel® VT)

Intel VT enhancements include large VT-d pages which add 1MB and 2GB page sizes to VT-d implementations. This also has the ability to share CPU's Enhanced Page Table (EPT page-table) (with super-pages) with VT-d. The benefits of large VT-d pages include:

  • Smaller memory footprint for I/O page-tables when using super-pages
  • Potential for improved performance due to shorter page-walks, allows hardware optimization for I/O Translation Lookaside Buffer (IOTLB)

Large VT-d pages are enabled by default on Romley and software applications running on this platform can automatically take advantage of this technology.

5. Conclusion

In summary, the Intel Xeon processor E5-2600/4600 product family, combined with the Romley platform, provides many new and improved features that could significantly change your performance and power experience on enterprise platforms. A developer can make use of most of these new features without making any changes to applications.

6. About the Author

Sree Syamalakumari is a software engineer in the Software & Service Group at Intel Corporation. Sree holds a Master's degree in Computer Engineering from Wright State University, Dayton, Ohio.

如需更全面地了解编译器优化,请参阅优化注意事项