Intel® Xeon® Processor E7 V3 Product Family New Reliability Features

1) Introduction

Intel introduced advanced Reliability, Availability and Serviceability (RAS) features in the Xeon® processor E7 product family to minimize the potential data loss and downtime due to server failures.  You can find more information about advanced RAS feature in this white paper. [1] The purpose of this article is to describe new advanced RAS features added to the Intel Xeon Processor E7 product family in 2015 that require support from the BIOS and Operating Systems (OS).  This product family is based on Intel® Core™ microarchitecture (formerly codenamed Haswell) and manufactured on 22-nanometer process technology.   

2) New RAS Features

The following RAS features require OS or Virtual Machine Manager (VMM) support as well as recovery mechanisms from a software perspective. 

  • Enhanced Machine Check Architecture Gen 2 (eMCA2)

eMCA2 is based on the gen 1 (eMCA1)[4]  that allows implementing Firmware First Model[2] (FFM) of error reporting (logging and signaling) and thus enhances the capability of the existing Machine Check Architecture (MCA).  FFM is a server system RAS architecture where all the error events are first signaled to platform specific firmware. The firmware processes the error logs and decides if and when to notify the OS/Application Software with the error information. eMCA1 can be configured to provide enhanced error log information to the OS and VMM that can be used to implement advanced  diagnostic and predictive failure analysis (PFA) for the system [7]. 

In addition to the eMCA1 capabilities, eMCA2 adds capabilities for BIOS/FW to take a first-line error handling role by identifying errors and collecting richer error logs in addition to existing hardware and OS/VMM-level error handling. With eMCA2 enabled, the BIOS SMM [8] handler can correct the error if possible before, optionally, causing a Machine Check Exceptions (MCE) [6] or Corrected Machine Check Interrupts (CMCI) to be signaled. eMCA2 redirects MCE and CMCI to firmware first (via SMI) [10] before sending it to the OS error handler.   It also uses FFM, which tells system software to get enhanced error information before doing error recovery and before the system shuts down.

eMCA2 adds the following capabilities to FFM:

            Attributes:

                Allows SMM [8] code to intercept MCE/CMCI

                Allows SMM code to write MCA Status/Add/Misc registers

                Allows SMM code to generate MCEs

                Allows DSM based pointer for enhanced error logs

                Additional IA32_MCG_CAP bit for eMCA Support

            Enhancement:

                Enhanced SMM such as Directed SMI, in-silicon SMM state save, extended SMM memory range register (SMRR2), and SMM security.

                Unlike eMCA1, eMCA2 can also write to Machine Check Banks, not just read.

            For more information, see [11].

  • Memory Address Range Mirroring (MARM)

Memory mirroring is a method of dividing system memory into two channels in which the first channel is mirrored to the second channel.  Memory mirroring helps avoid having to shut down the system when a fault occurs in memory.  With memory mirroring, if a fault occurs, the memory controller will automatically switch to the second channel without affecting the OS.  As soon as the issue is resolved, the two channels will re-synchronize.  This memory redundancy will minimize operating disruption.  However, it increases the cost and power compared to a non-mirrored system.

To utilize the benefit of memory mirroring while reducing the memory cost and system power, Intel® Xeon® E7 v3 product family introduce a RAS feature called Memory Address Range Mirroring.  This feature allows mirroring only certain portions of memory instead mirroring the entire memory.

The MARM feature enables the firmware or OS to determine a range of memory addresses to be mirrored, leaving the rest of the memory in the non-mirror mode. The failover to the mirrored memory occurs without the knowledge of the OS and applications. Currently, the processor supports up to two mirror ranges. OS kernel space is a likely candidate for MARM.

For more information about this feature, see [13].

3) Conclusion

Additional advanced RAS features allow the Intel Xeon processor E7-4800 V3 product family to increase the uptime and reduce cost by allowing the BIOS/FW to handle the errors even before, optionally, signaling the OS and by only mirroring certain critical part of memory such as the kernel space.  They also offer richer error logging capability allowing better fault diagnosis.

4) References

[1] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-e7-family-ras-server-paper.pdf

[2] http://www.intel.com/content/dam/www/public/us/en/documents/research/2012-vol16-iss-2-intel-technology-journal.pdf

[3] http://en.wikipedia.org/wiki/PCI_Express

[4] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/enhanced-mca-logging-xeon-paper.pdf

[5] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-297-304.pdf

[6] http://en.wikipedia.org/wiki/Machine-check_exception

[7] http://en.wikipedia.org/wiki/Predictive_failure_analysis

[8] http://en.wikipedia.org/wiki/System_Management_Mode

[9] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/platform-level-error-strategies-paper.pdf

[10] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/sect-Realtime_Reference_Guide-Hardware_interrupts-System_management_interrupts.html

[11] eMCA2 paper

[12] http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-processor-7500-series-vol-2-datasheet.pdf

[13] Memory Address Range Mirroring paper

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.
Возможность комментирования русскоязычного контента была отключена. Узнать подробнее.