A few months ago, Intel® simulation engineers working on Wind River* Simics* reported a bug in the Xen* hypervisor to the public email list. The bug was that Xen did not do the right thing when enabling Intel® Memory Protection Extensions (Intel® MPX). Xen used the feature inside virtual machines (VMs) even when it was not actually available to VMs, and thus failing. It was a small bug that was quickly patched, but it offered an interesting insight into the dynamics of software testing and the importance of testing with a wide variety of target machines.
The bug as reported on the email list was:
If MPX is supported by CPUID, but MPX is not supported by VMX, XEN is failing on store CPU MSR GUEST_BNDCFGS (file xen-4.7.0/xen/arch/x86/hvm/vmx/vmx.c:798).
In other words, Xen made the assumption that if the host that it was running on supported MPX, it would also be supported inside VMs. MPX support inside VMs is implemented using Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (VT-x), and is logically a separate function from MPX exposed to software running directly on the host. Most commonly, MPX support on the host and MPX support inside a VT-x-based VM appear together – but it is not actually true in all cases. The Intel Architecture (IA) Software Developer’s Manual (SDM) makes it clear that software needs to check specifically if VT-X (VMX) supports MPX for the virtual machines before using it. However, such a check is easy to forget and hard to test unless you actually run the software on a system that supports one but not the other.
The bug in Xen was easy to fix, since it was just a matter of adding an additional flag check.
Note that “VMX” is the name of the CPUID flag identifying support for VT-x, and is a common term used to refer to VT-x in software and discussions.
The Test Setup
The bug was found by running the Xen VM on top of Simics, in a software stack with quite a few layers of virtualization. The picture below shows the layers:
The Simics target machine, the virtual platform, can implement any Intel instruction set extension in any way that is consistent with the SDM specification. It does not depend on what the host machine implements – the virtual platform is entirely independent of the host, and can simulate new instruction sets on top of older hardware, as well as older hardware on top of new hardware. The presence of VT-X or MPX on the host is irrelevant – no matter what the host supports, the Simics target will have the precise support we specify for it.
In this case, the target implements a variant where MPX is available to software running directly on the processor, while virtual machines using VT-X to run do not have access to MPX inside of VT-X. This particular configuration exposed the bug in Xen that was reported by our team.
The Lesson: Variation Matters
The key lesson that this bug illustrates is that software needs to be tested on many different configurations in order to be truly robust. Testing can only reveal issues that manifest themselves on the available hardware and software configurations used for testing. I have talked to many people over the years about this particular pitfall of system and software testing, and sometimes it seems like a near miracle that software works at all on any machine beyond the one it was built on…
For example, when I was a PhD student in the late 1990s, I learned this in a very concrete way – we had a software program that we used for our research which ran on Linux, Microsoft* Windows*, and Sun* (now Oracle*) Solaris* (on SPARC!*). It was very common to find bugs that manifested themselves on only one system but not another. In particular, uninitialized variables caused crashes much more often on one operating system. I think it was Linux that nicely zeroed memory you were given by the OS, while Windows just left it as it was… leading to rather different behaviors as one might imagine. We also had to be very careful about integers, mixing 32-bit little-endian and 64-bit big-endian hosts as we did.
Another old example is a bug from 2008, where Windows would sometimes not use all processor cores in a machine if the number of cores in a socket was not a power of 2. It is easy to see how something like that would happen – the first wave of multicore processors all had 2 or 4 cores, and all testing would show things worked well. However, once the first processor with 3 cores came along, the bug was revealed.
Getting more Variation
It is clear that testing software on a wide variety of machines is necessary in order to make it truly robust. More variety and more machines different from the developer’s machines make for software with fewer issues. More variation is better for testing, and techniques like continuous integration and continuous testing should strive to use a wide variety of target for software testing. A homogeneous test farm is convenient in some ways, but will also tend to miss issues. In particular, issues related to hardware drivers and the hardware-software interface needs varied test setups and unusual configurations to be comprehensively exercised.
One way of getting variation is to use simulation, which is exactly how the Xen MPX bug was found. In general, using simulation technology like Simics, we can create a much wider variety of configurations than even a physical board full of boards could achieve. An attractive aspect of simulation is that you can create setups that do not actually exist in hardware or at all, or in hardware that is impossible to procure. Using a simulator is a more stable and affordable solution, compared to procuring and managing a lab full of hardware variants and boards.