Does anyone know a tool for monitoring number of errors detected by ECC memory/controller?
HP Integrated Lights-Out can report ECC memory errors.
Link to HP whitepaper :http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02878598/c02878...
I wanted to get an ECC enabled machine to see how often DRAM errors occur in my environment.
(Interesting: you cannot understand whether you need ECC, unless you already have it?)
But, after reading the xeon-e5-2600-uncore-guide, this HP paper and MS WHEA docum, the whole ECC topic looks too intimidating.
I'll surrender for now...
Please look at this manual for Intel Xeon E7 processors. FVC events can be configured to count memory ECC errors (see page 2-126 for example). They can also count corrected/uncorrected memory request responses.
Low level details of hardware and/or its programming interface are not an easy thing to grasp very quickly:)
Thanks guys. I see your point, Ilya... There's an anecdote about senior and junior toilet cleaners... ;)
My goal is to measure how often RAM errors occur on my machines and whether I want ECC.
But the DRAM controller of Xeons (and the ECC RAM itself of course) looks much more complex than on "normal" non-ECC mobos, there are more parts that may fail. Do you think that measurement of RAM errors rate on ECC enabled machine can be extrapolated to a simpler non-ECC sandy/ivy bridge system?
Building the PCM to get the counters is not a problem.
Does PCM measure ECC errors?
So... is ECC worth the extra money... that is a good question.
My first response is, how much does it matter whether you can catch memory errors?
If you are doing something where you don't mind rebooting then you probably don't need ECC memory.
For mission critical applications where you absolutely need to know whether there are memory issues (yes, DIMMs do go bad) then ECC is a requirement. This is why servers always have ECC support.
I think you can monitor ECC errors on windows in the system event log in the event viewer (eventvwr.msc).
> Does PCM measure ECC errors?
I have not checked this yet. Even if not, the docum explains how to get these counters.
> So... is ECC worth the extra money... that is a good question.
The ECC RAM modules cost not much more, it is a whole new machine of a higher class that is expensive...
Finally we've got approval for a Dell server. The exact model and h/w details not known yet.
>>>I think you can monitor ECC errors on windows in the system event log in the event viewer (eventvwr.msc).>>>
This is implemented by WHEA architecture.
>>But, after reading the xeon-e5-2600-uncore-guide, this HP paper and MS WHEA docum, the whole ECC topic
>>looks too intimidating. I'll surrender for now...
In 2012 I saw some Intel equipment and I remember it allowed to simulate some memory errors for server platforms. Honestly, I didn't dare to ask how much it is...