How we get improved performance on a single core - Part 3

This post was originally published on blogs.rapidmind.com. RapidMind was acquired by Intel Corporation in August of 2009, and the RapidMind Multi-core Platform will merge with Intel Ct technology. Before joining Intel as part of the acquisition, Stefanus was a co-founder of RapidMind.

This is the third, and last, post in a series of posts about how we can achieve improved performance over regular C++ code even when running on a single core. I’ve talked about our programming model and runtime program generation previously. In this post, I’ll discuss our runtime code generation mechanism.

As mentioned in my last post, RapidMind generates machine code at runtime. This is similar to just-in-time compilation, but is done at very specific (and controllable) points in an application’s lifetime – typically during application initialization. The responsibility of generating machine code for a specific hardware target belongs to RapidMind’s backends. Each backend includes code generation support for any targets it supports. For example, the OpenGL backend for GPUs generates OpenGL shading language programs corresponding to a user’s computations. Backends like the x86 and Cell backends generate machine code for those architectures using a custom code generation stack, including a backend optimizer, scheduler, register allocator, etc.

The x86 backend is particularly interesting. Our x86 backend targets x86-based CPUs such as those from AMD and Intel. The x86 instruction set is very old, and modern x86 processors provide more of a translation layer from x86 instructions to some underlying architecture-specific instruction set (the microarchitecture of the processor) than a direct implementation of the x86 instructions. Therefore one x86 processor is not like another - microarchitectural differences between vendors and even between different processor generations are vast. This means that a single x86 binary compiled with just one particular microarchitecture in mind may not perform optimally on another microarchitecture. Even though modern CPUs have features like out-of-order scheduling that help “generic” code execute well, we’ve found there is often still plenty of performance to be obtained by performing microarchitecture-specific optimizations.

This problem (or opportunity?) is further compounded by extensions to the x86 instruction set. Starting with extensions like MMX and 3DNow, processor vendors have been providing new instructions only implemented in newer hardware. These instruction set extensions are generally aimed at accelerating particular types of computations, e.g. by providing vector operations that can compute multiple instances of the same operation at once. Today the prevalent family of extensions is the “SSE” (Streaming SIMD Extensions) family. Many different SSE extensions are implemented in hardware shipping today: SSE, SSE2, SSE3, SSSE3, SSE4A, SSE4.1, and SSE4.2. By 2010 processors will be shipping with support for SSE5 (from AMD) and AVX (a new instruction set extension by Intel). These extensions provide a lot of opportunity for improving performance, but code generated targeting a particular extension will not run on older processors that do not support it. Traditional software development thus has to either use some lowest common denominator (e.g. SSE2, which is supported in most processors shipping over the last 4 years or so) or provide many different binaries of the same code and pick one at runtime. These compatibility issues have really hampered adoption of these extensions.

Both of these issues - microarchitectural differences and new instruction sets - are addressed by our backend and code generation design. Since our platform generates code for performance-critical pieces of applications at runtime, we can check to see exactly which CPU we are running on, and generate code optimized for that CPU. Even though we only require SSE2 support to run, we will generate code that makes use of other SSE extensions if they’re available. We schedule instructions very differently on AMD processors than we do on Intel processors, because of differences in how these processors execute code. Taking advantage of the specific microarchitecture we’re on can yield anywhere from a 10% improvement to a doubling in performance!

We use the same mechanism to optimize generated code based on other factors known at runtime, such as the alignment of arrays in memory or knowledge of certain values being constant, but unknown until the application is actually running (e.g. constants read from a data file during application initialization). Unlike a traditional just-in-time (JIT) compiler, our code generation happens at very specific, predictable, and controllable times. We never interpret code, and we don’t have to profile at runtime to find hot spots. Portions of an application not expressed with RapidMind, such as UI or data-handling code, do not undergo this mechanism. Therefore the overhead impact of this runtime work is minimal, and it becomes worthwhile almost immediately.

This concludes my set of articles on why RapidMind can often get not only improved scalability across cores, but also improved performance on a per-core basis than code expressed without RapidMind. I hope you found it interesting! As always, feel free to leave comments if you have any further questions, and I’ll do my best to answer them.
For more complete information about compiler optimizations, see our Optimization Notice.
Categories: