In my previous blog post “Question: Does Software Actually Use New Instruction Sets?” I looked at the kinds of instructions used by few different Linux* setups, and how each setup was affected by changing the type of the processor it was running on. As a follow-up to that post, I have now done the same for Microsoft* Windows* 10. In this post, I look at how Windows 10 behaves across processor generations, and how its behavior compares to Ubuntu* 16. I also went back to Ubuntu to look at the instruction usage in different usage scenarios.
Just like before, I took our “generic PC” platform in the Wind River® Simics® virtual platform tool, and ran it with two different processor models. One model was for an Intel® Core™ i7 first-generation processor (codenamed “Nehalem”), and the other model was for an Intel Core i7 sixth-generation processor (codenamed “Skylake”).
Using these different models, I booted a Windows 10 image (build 1511, to be precise). Just like before, I ran through the first 60 seconds of the boot, ending up at an idle desktop.
Here is screenshot from my laptop booting a couple of Windows 10 targets to gather the statistics:
During the boot, I used the Simics tooling environment to collect statistics on the types of instructions seen. I counted the instruction usage per mnemonic, just like in the previous post. In addition, I did some runs where I looked at the instruction stream based on other criteria, as discussed below.
First, I looked at all instructions that occur as one percent or more of the dynamic instructions during the boot. The results are shown in the graph below:
What we see in the Windows data is very similar to what we saw for Ubuntu 16 in the previous blog post. There are some small changes between the processor generations, but mostly the same code is run regardless of the processor.
This is not all that unexpected for broadly used general-purpose operating system distributions like Ubuntu and Windows. The most dramatic difference seen in the previous blog post between the processor generations was for the Yocto* Linux build, which makes sense. Yocto lets you build a Linux for yourself, and its defaults can be more aggressive in terms of including code to use new instruction sets, since you don’t generally have to support a wide user base. For Ubuntu and Windows 10 with their broad user bases and common goal to work reliably for a very large number of users, having too many differences between hardware generations would make testing and quality control harder. It makes sense not to aggressively optimize for a single system, unlike what you can do when you roll your own Linux.
Anyway, if we look at the instructions used, the most common instructions are moves, compares, jumps, and basic arithmetic. This is very similar to what we saw on Linux, but the precise instructions used does differ a bit…
When changing operating system as we did here, the compiler used to build the code changes (from gcc on Linux* to Microsoft* compilers on Windows), along with conventions around how function calls and operating system calls are done. All of this impacts the instruction selection process in the compiler, and as a result the actual instructions used can be rather different between the workloads. Indeed, we even see some instructions that are uniquely used in just one workload. We have some examples of this in our current dataset.
Windows 10 never uses LEAVE or ENTER instructions. As discussed in my previous blog post, the old Linux 2.6 Busybox* setup did use LEAVE rather extensively, but more recent Linux distributions did not use it. Since Windows 10 is a rather recent software stack, it makes sense that it no longer uses even the LEAVE instruction.
There are some instructions that Windows 10 uses, but that none of the Linux setups did. The most significant one is the MOVNTI instruction from the Intel® Streaming SIMD Extensions 2 (SSE2) instruction set. It makes up more than three percent of the instructions on Windows! In addition, outside of the most common instructions shown above, Windows 10 uses several unique vector instructions that none of the Linux variants used: PADDW, PSRLW, PMOVZXBW, PUNPCKHQDQ, PSUBW, and PMADDWD. Given the richness of the vector instruction sets, this is not really that surprising.
The CMC (complement carry flag) instruction from the original 8086 instruction set is also used by Windows but not Linux. The same holds for BSR (bit scan reverse) from the 80386. They are not exactly common (measured at less than 0.01%), but still interesting to see that they are never used in the Linux boots.
Note that this is just about the operating system boot process; the picture is likely to be different for application software. Indeed, I did some other experiments with Linux that showed that rather starkly, as discussed below.
Vector instructions are not used all that much during the boot. The difference between the v1 and v6 processors is not particularly big for Windows – it was much more pronounced on Linux. However, it gets more interesting when Windows 10 is compared to Ubuntu 16:
Overall, Windows and Ubuntu use the same proportion of vector instructions during the boot (roughly 5%). These vector instructions are distributed rather differently, though. Windows uses more SSE2 instructions, while Ubuntu uses more MMX instructions. Windows also does not change the instructions used quite as much between generations, with a barely perceptible use of Intel® Advanced Vector Extensions (AVX) on the v6 processor.
This investigation of instruction mnemonics is really just a simple example of what you can observe in a software run using instrumentation in a virtual platform. It is rather informative, but there is lot more that can be observed and counted. As a Simics user, you can program tools to collect pretty much anything you want to (as long as it is part of the virtual platform of course).
As an example, here is the distribution of instruction sizes during the Windows 10 boot on the v6 processor:
This works out to an average size of about 3.73 bytes per executed instruction. Note that this does not really say anything about the size of the code. It is rather an indication of the pressure that the code puts on the cache system and processor decoders. Intel® Architecture (IA) is a classic variable-length instruction set, which is clearly seen here with instruction lengths varying all the way from 1 byte to 14 bytes. It is worth noting that really long instructions are also really rare.
Another way to slice the instructions is to look at the operand types along with the instruction opcodes. This is a more fine-grained split than the mnemonics used above. For example, the 20 most common variants of the MOV instruction are the following:
And this is far from all of them… there are many other particular addressing modes being used. It is classic long-tail distribution: the most common modes make up by far the majority of all MOV operations, while the more complex modes are used often enough to still matter.
Note that we are seeing moves of all sizes here: just because this is a 64-bit Windows operating system running on a 64-bit processor does not mean that all operations are actually 64 bits in size. Byte (8-bit), word (16-bit), and double-word (32-bit) operations are also being used. 32-bit is as common as 64-bit.
When discussing these measurements with one of my colleagues, the question came up about vector instructions in general and AVX instructions and how they very much depend on the workload being used. An operating-system boot is not likely to use them for more than a little crypto and possibly some highly-optimized memory copy operations. But he had seen some other behaviors when using a system interactively. Thus, one more experiment was made, where I took the v6 processor with Ubuntu, and started to run some interactive software after the boot. Essentially, opening a terminal and starting a new Firefox process.
As can be seen from the diagram, the desktop activity makes extensive use of AVX instructions – including even the rather new AVX2 instructions and FMA3 instructions. Vector instructions actually comprise more than 12% of all instructions executed- and remember that this includes all instructions in the whole machine, not just the user-level code or the code in the graphics subsystem.
This was a second blog post with graphs and numbers detailing the different types of instructions being executed in a number of different workloads across different processor types. It is an set of data for a computer architecture nerd like me. However, the most interesting thing is how the numbers were collected – using Simics and its instrumentation capabilities. Simics can simulate pretty much any system, and allow for non-intrusive inspection and debugging. Collecting instruction statistics such as I did here offers useful insights for processor designers, software engineers, researchers, and students.
A short plug here: Simics is available for free to universities, and it is a very versatile tool for subjects including computer architecture, operating systems, networking, embedded systems, simulation, virtual platforms, and low-level programming.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804