We are currently investigating a very interesting problem which manifests itself as completely random "blue screen of death" crashes after varying amounts of time (2 hrs - 2 weeks). This has been seen on all computers that we deliver (100+ so far).
Crash dumps have been analyzed by ourselves and Microsoft and all information points to a hardware problem. This conclusion is based on the fact that the CPU raises an illegal instruction exception on a perfectly legal instruction executed in the Windows kernel. The CPU in question has the following spec: Intel Celeron M, 600 MHz (ULV), S-spec SL7GE, product stepping B-1, signature 0x695. We have been able to reproduce the problem under Windows XP Embedded SP1 & SP2 and also under XP Professional SP1. It's an "industrial grade" PC.
A great deal of effort has been put into finding the root cause of this problem and also in finding a workaround for it. During this investigation we have come up with the following two workarounds that prevents the computers from crashing (~60 units have been running for about three weeks without a crash):
- Making the CPU work at 100 % load all the time by having a low priority thread do calculations.
- Switching off ACPI by changing to a non-ACPI Windows HAL.
We then found the Specification Update for the Celeron M. Of particuluar interest is errata W38:
[W38] Snoops during the Execution of a HLT (Halt) Instruction May Lead to Unexpected System Behavior
Problem: If during the execution of a HLT instruction an external snoop causes an eviction from the instruction fetch unit (IFU) instruction cache, the processor may, on exit from the HLT state, erroneously read stale data from the victim cache.
Implication: This erratum may lead to unexpected system behavior. Intel has only observed this condition in nonmobile configurations.
Workaround: It is possible for the BIOS to contain a workaround for this erratum.
The description above matches our problem description really well in our opinion. This naturally make a few questions arise:
- Is there any way in which we can verify that we are indeed dealing with the W38 bug?
- How is it possible to work around this problem via a BIOS update? Microcode update? Has Intel provided BIOS manufacturers with such an update? The hardware vendor claims that there are no BIOS updates available and the current BIOS dates from 06/28/2006.
- Are there any other possible workarounds we can try out?
Or do you have any other suggestions for how we can continue to troubleshoot this issu
e? We would be most thankful for any feedback.