by George Walsh
There's really no denying that application optimization yields performance benefits. The question in each case is whether time spent optimizing and resulting performance gains is worth the development time, effort and cost. As part of his work with the Intel® software team Eric Palmer works closely with ISVs, helping to boost performance of their applications. Palmer has developed a four-step approach--it works for him, and it will work for your applications too.
Use Compiler Optimizations
How well your source code is turned into machine code is an obvious place to start when trying to reduce bottlenecks. Here, Palmer begins with a very simple technique before attempting anything fancy: he recompiles the code using the latest version of Intel's optimizing compiler. "We first go for the lowest hanging fruit and recompiling usually falls into that category," Palmer says. "Typically, we start by using the same compiler switches application developers would start with, and use the Intel compiler to see whether there's a speed difference." If a simple recompile on the Intel compiler leads to a speed increase, Palmer first recommends that the ISV switch compilers.
The next step is to start adding switches to the compiler's command line to see how they affect the compiled app's performance – starting with /g7, a switch that optimizes for the Intel NetBurst® microarchitecture.
"The code that you generate with that switch will work on any Intel-compatible processor clear back to a 386, but it's tuned for the Pentium® 4 processor," Palmer says. Next, he adds on /QxW, which generates code that will only run on the Pentium 4. "If that turns out to be beneficial, but the ISV wants to generate code that works on all platforms, we would switch that from /QxW to /QaxW." Adding the "a" increases the code size because it includes a version for the Pentium 4, along with different versions for various Pentium processors. Depending on the code, if the compiler determines there's a benefit to making a machine-specific version of each function, it will generate them, and the processor will automatically choose which version to use at runtime.
Other compiler switches Palmer uses include /O3, which enables high-level optimizations; /Qipo, for inter-procedural optimization across all the files in the application; and a two-step "profile guided optimization" switch that optimizes all of the branches the application executes. The use of these switches is based on both experience and trial and error. "If the code speed is the same or faster after adding a switch, I'll typically leave the switch there. If it's slower, I'll take the switch away. Sometimes, if I see the same performance but the switch increases the size of the binaries, I might take it away to try to keep the binary small."
Locate Hotspots with VTune™ Analyzer
Palmer's next weapon in the attack on slow code is the Intel® VTune™ Performance Analyzer. Here, the first step is to use its sampling feature to find those areas of the code that take the most time. "Th e application may have one function that eats up 50% of the application's execution time, and that's an obvious place to start looking for areas to optimize," Palmer says.
Unfortunately, finding a single function that slows down the entire application can turn out to be a best-case scenario.
Sometimes, an application has many functions with no one taking more than 5% of the total runtime. "To speed it up you'd have to go through each of these little functions and get a small speed increase each time," Palmer says. "We call that a 'flat' VTune analyzer profile. That's when you hope that the compiler will make a difference. If there's nothing that really lends itself well to manual optimization, the compiler can still make a difference by touching every part of the program at once." He also adds that the VTune analyzer can be used even when the application wasn't built using the Intel compiler.
When using the VTune analyzer to locate areas that are slowing an application, it's important to use a debug build of the app or to generate the debug info so that it can correlate locations in the binary with locations in the source code. If you use a release build without debug information, VTune analyzer looks at machine code, which isn't very convenient when you need to know where to tweak the source. Once areas of an application that are least efficient are located, Palmer then attempts to speed them up. "I proceed in descending order of the number of clock ticks the functions take. I look at the one at the top of the list and go into it and see if it looks like something that I can optimize using things like Streaming SIMD Extensions (SSE2) or SSE instructions."
Tune by Hand
It's at this point that Palmer proceeds to manual optimizations, which are accomplished in three ways. The first is to use C++ vector classes, the second is to use compiler intrinsics, and the third is to write assembly code. Obviously, these approaches are listed in increasing order of difficulty. Having the appropriate documentation on hand can be valuable when using vector classes and intrinsics-both are documented in the appendices of the IA-32 Intel® Architecture Software Developer's Manual, Volume 2 (intrinsics, by the way, are supported by the Intel® compiler as well as Microsoft* Visual C* 6 with Processor Pack 5). See Intel's IA-32 manuals.
MI optimizations like these require a fair amount of expertise recognizing what approach should be taken to speed up the code. Here too, VTune™ Performance Analyzer can help. VTune 6.1 has a feature called the Intel® Tuning Assistant. After VTune performs its event sampling, the results can be fed into the Tuning Assistant, which parses them and creates an HTML document detailing the problems it has found.
You might learn that the application spends half of its time in a certain function because of L2 cache misses. Or it may discover that a large amount of speed is lost due to branch mispredictions. A tool like this is useful because otherwise you'd need to know the type of event to sample for. "The Tuning Assistant is like a knowledge base," Palmer says. "It knows how to interpret the results tha t otherwise you'd have to be an expert to understand." If you use the Intel Tuning Assistant, remember you must run it on a single-processor system. If you have a dual processor system, be sure to reboot it with only one processor enabled.
Catch the Cache
It's a fact that it's impossible to know how much it will cost you in development time to achieve a given amount of speed. It's also a fact that finding problems after an application ships is often much more costly than finding them during the development process. Palmer offers up one particularly sobering example of how optimization techniques can save your bacon. "In the case of one ISV I went to visit, VTune™ Performance Analyzer revealed that L2 cache misses were eating up around 70% of their total execution time," he says. "Without the right tools, I would have had this long list of steps to go through and checked a whole bunch of things individually. I might have missed something." It turns out that the ISV had to change their core algorithm to address the cache misses, but by addressing that one problem, they were able to speed up the whole application by a factor of 4X.
Palmer concludes that there's no way to predict how much of a speed boost you'll get for every hour devoted to optimization. "It depends on the type of application and how well the application was written to begin with," he says. "If it's an image processing application, using SSE2 you can process 16 pixels at a time, instead of processing one at a time. That could speed up your code by anywhere from 3X up to 8X." While an 8X speed increase may not be the norm, the combination of any additional speed along with the quality assurance benefits offered by optimization make it a valuable part of the development process. As the old adage says, "A stitch in time saves nine."
About the Author
George Walsh is a veteran tech editor and writer with experience in fields ranging from embedded systems programming to CAD. As a freelance researcher and writer he has provided his expertise to more than 30 clients in a wide variety of markets.