ECC memory percentage?

ECC memory percentage?

solofo r.'s picture

Hello,

I am looking for the percentage of GDDR5 memory used when ECC is activated on Xeon Phi to check that the result of micinfo (GDDR Size) is correct.

I've read 12.5% (1 bit per byte), 3.125%, and observed 3.125% (32MB per GB).

Where can I find this information? (I did not find it on the specification of the different Xeon Phi models)

Thanks,

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
solofo r.'s picture

anybody?

jimdempseyatthecove's picture

While I have not seen the ECC layout for Xeon Phi, all the implimentations I am familliar with have the ECC bits in memory that is not accessible as storage memory. Typically, for every 8 bits of storage, one extra bit is required in support of ECC (the 12%). However, some implementations use more bits to increase the detection and correction capabilities. This bit is not stored within a table accessible to the programmer. Rather the memory bus is wider than the program accessible data. Hardware in the memory system perform the error detection and when possible the correction. Status information can be read by the processor, but the detection and correction is out of control of the CPU. When you order memory (sticks), the capacity does not include the ECC bits. In looking at (some) memory stick, where a non-ECC memory may show 8 memory chips, an ECC varient will show 9 chips.

Jim Dempsey

www.quickthreadprogramming.com
solofo r.'s picture

Thanks Jim.

> [...] all the implimentations I am familliar with have the ECC bits in memory that is not accessible as storage memory.

I thought this was different for the Xeon Phi because of this :

http://download.intel.com/newsroom/kits/xeon/phi/pdfs/Intel-Xeon-Phi_Fac...

"The Intel Xeon Phi coprocessor boasts up to 8 GB of memory capacity (7.75GB with
ECC enabled), [...]"

John D. McCalpin's picture

With the wide DRAMs used in Xeon Phi (and in GPUs), it is not practical to widen the interface to hold the ECC bits.  So these use what is usually referred to as "in-line" ECC.   On Xeon Phi the output of "cat /proc/meminfo" makes it clear that exactly 1/32 of the DRAM "disappears" when ECC is enabled.

I don't know of any detailed public disclosures of the implementation of the ECC, but the article on "Hamming Codes" at Wikipedia provides the formulas for the number of bits required for using Hamming codes plus an extra parity bit to provide SECDED.  A quick calculation shows that  the standard approach requires at least 10 parity bits (protecting up to 502 data bits) to get the overhead down to less than 1/32.  If the protected data is limited to powers of 2, then 11 parity bits (protecting up to 1013 bits, but rounded down to 512) allows one cache line (512 bits) of parity bits to protect 46 cache lines of data.  This is an overhead that is well under 1/32.

It is unlikely that Intel has done anything so simple.  There are a lot of extra states that could be protected using the extra degrees of freedom, and it is common for vendors to also protect against "chip kill" syndromes (in which all of the bits from a particular DRAM are stuck at either zero or one).   The issue of updating the ECC bits is also tricky -- one would prefer not to need to perform a read-modify-write cycle on a full cache line of ECC bits every time a cache line is written to memory.   Again, Intel has not disclosed details, but I would imagine that they created a customized error-correcting polynomial that both provides the required SECDED coverage and provides other convenient features.   This note was mainly to demonstrate that the log(N,2) nature of Hamming codes allows reduction in the overhead by increasing the size of the data block covered.

An additional complication of the use of 1/32 of the memory for error-correcting bits is that "physical" addresses have to be remapped to avoid the "holes".    This might add unexpected jumps in the 8-way round-robin interleave across the 8 memory controllers.  (In the late 1990's we did something similar to provide inline NUMA directory memory on the SGI Altix 3000, and maybe on the SGI Origin3000 as well -- I was less involved with the design on that one.)

John D. McCalpin, PhD "Dr. Bandwidth"
solofo r.'s picture

Thanks, John for these explanations on why GDDR is different from DDR from the ECC point of view.

I realize now that my question was maybe not clear enough.

What I was simply asking:

- Is the ratio 32MB by GB the same for all Xeon Phi models? (that is what I observed)

- Is it specified? (it may be not, for example because this could be subject to change with new firmware)

John D. McCalpin's picture

I have not been able to find any documentation that specifies the ECC overhead.  As far as I know it is the same for all currently shipping Xeon Phi models, and based on the few details that I have been told about the implementation, it makes sense that it would be a fixed percentage.  It is probably safe to assume that it will remain 1/32 for any (hypothetical) future products in the Xeon Phi (Knights Corner) family, and there is almost certainly too much hardware support built in for this to be changed in a firmware update.

Of course since Intel has not described the details of the ECC implementation in the currently shipping product, there is no way to know how ECC might be implemented in the next-generation product (Knights Landing).

John D. McCalpin, PhD "Dr. Bandwidth"

Login to leave a comment.