ALU/CPU Error Detection

ALU/CPU Error Detection

This question is about detecting and possibly correcting arithmetic errors in the CPU.

Consider any recent core iX Intel CPU. Suppose I wanted to multiply 8 times 7, in psuedo-code thusly,
mov 8 to AX
mul AX by 7

Then further suppose the ALU came up with 55 instead of 56. Would the CPU issue an error of some kind. If so, how?  Fundamentally my question relates to errors in arithmetic, which because of cosmic rays and voltage blips are almost certain to occur.  Is there any logic in the ALU/CPU to detect and possibly correct such errors?

The answer is important. At least tell me how to find the answer.

12 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

A very old Intel Pentium CPU Floating-Point Bug:

...
// Sub-Test for Intel Pentium CPU Floating-Point Bug
// Note: There is a very small set of numbers for which the product of two numbers
// is computed differently on Intel 486 and Pentium CPUs

RTfloat fValue = 4195835.0f - ( ( 4195835.0f / 3145727.0f ) * 3145727.0f );
RTdouble dValue = 4195835.0L - ( ( 4195835.0L / 3145727.0L ) * 3145727.0L );

// Results:
// Intel 486 CPU = 0
// Intel Pentium CPU = 256
// All the rest Intel CPUs = 0
...

was detected by a scientist... and I look forward to hearing from Intel hardware / software engineers regarding possible error correction for integer arithmetics.

Thanks for reminding us of that incident; it was very expensive. I am supposed to be researching the reliability of real time control systems. A microprocessor that errs in this context is almost intolerable, particularly if the error persists over more than a few cycles. Arithmetic error checking is possible in software, but it would less expensive if it was done in hardware and the cost was spread over all purchasers, not those who feel they have to have it, as cruel, mean, and common as that reads.

>>>Then further suppose the ALU came up with 55 instead of 56.>>>
I would be more afraid of various errors related to the floating point calculations.Long looped calculation based on improperly represented values which are polluted by rounding and truncation errors can be very annoying.
If you are interested in rounding and truncation errors of variuos math calculations I would recommend you a very good book titled "Real computing made real".

>>>A microprocessor that errs in this context is almost intolerable>>>
It depends on the toleration of the maximal error as a function of machine accuracy.
http://en.wikipedia.org/wiki/Machine_epsilon

Cosmic particles, heat, and similar reasons can cause a large variety of possible errors in a complex system like a computer. This ranges from errors in DRAM, the memory controller, QPI links, caches, to the processor core. Some of the errors are correctable, some are uncorrectable. The mechanism to monitor this is the Machine-Check Architecture (MCA). You can find a description of MCA in section 15 in volume 3B of the software developers manual:

http://www.intel.de/content/www/us/en/processors/architectures-software-developer-manuals.html

Error codes are described in Section 15.9.

Depending how important reliability is for you, you can use ECC, DDDC, memory mirroring, QPI in lockstep mode, or even use a complete lockstep system like the systems from NEC or Stratus.

Please take a look at a web-link: http://www.xbitlabs.com/articles/cpu/display/elbrus-e2k.html

...
...in 1993 Intel introduced an absolutely new 32-bit Pentium processor with lots of new features...
...
- functional redundancy checking support - two Pentiums perform the same calculations, comparing results after each step. If the results do not match, processors repeat the calculations.
...

At least it gives you some information on what was done by Intel in the past.

>>>Please take a look at a web-link: http://www.xbitlabs.com/articles/cpu/display/elbrus-e2k.html

...
...in 1993 Intel introduced an absolutely new 32-bit Pentium processor with lots of new features...
...
- functional redundancy checking support - two Pentiums perform the same calculations, comparing results after each step. If the results do not match, processors repeat the calculations.
...

At least it gives you some information on what was done by Intel in the past.>>>

I answered one of your old posts when you wrote about the anniversary of SIMD architecture and I found documents when it was clearly stated that Pentium and SIMD vector floating point unit architecture is based on soviet Elbrus cpu architecture.It simply amazes me how advanced was former soviet research in the field of supersclar processors.

>>>Cosmic particles, heat, and similar reasons can cause a large variety of possible errors in a complex system like a computer.>>>

Even simply clock frequency jitter which manifests itself as a miniscule phase shift(s) can change the value of the bitstream.
It is intersting how for example floating point calculation and memory reading/writing could be affected by such a erratic behaviour and what is the probability of stochastic noise induced bus errors.

Citation :

iliyapolak a écrit :

>>>Cosmic particles, heat, and similar reasons can cause a large variety of possible errors in a complex system like a computer.>>>

Even simply clock frequency jitter which manifests itself as a miniscule phase shift(s) can change the value of the bitstream.
It is intersting how for example floating point calculation and memory reading/writing could be affected by such a erratic behaviour and what is the probability of stochastic noise induced bus errors.

Where I can find information about random noise induced CPU and/or data bus related errors.

I found a small article about ECC on Wiki and please take a look: http://en.wikipedia.org/wiki/ECC_memory

It is not related to the original question ALU/CPU Error Detection but at least it provides some information on how memory errors could be corrected.

Citation :

Sergey Kostrov a écrit :

I found a small article about ECC on Wiki and please take a look: http://en.wikipedia.org/wiki/ECC_memory

It is not related to the original question ALU/CPU Error Detection but at least it provides some information on how memory errors could be corrected.

Thanks for the link.It partly answers my other question also related to the statistical rate of the errors,

Connectez-vous pour laisser un commentaire.