Improving software performance on parallel software requires a structured approach that makes good use of development resources, obtaining good results quickly. This paper breaks down such an approach into three distinct stages:
Together, these three stages help software development organizations optimize their software efficiently, evolving it to take advantage of increasing numbers of processor cores.
The most important thing to keep in mind when optimizing an application is to create a systematic approach and to stay with it. That means adhering to scientific method, as well as planning what you intend to do and then carrying through with your plan.
Starting Off on the Right Foot
A few general precepts will be valuable to you from the moment you begin your first conversations in the planning process until you decide your application has reached its ultimate goal:
Identifying the Steps
With these rules firmly in hand, it is time to create a coherent methodology that allows you to decide what changes to make, enact those changes, and measure their effect on the performance of the application. The first thing to understand here is that this is an iterative, closed-loop cycle, as shown in Figure 1. Each sequence of steps (one iteration around the circle) designs, implements, and verifies one (and only one) change to the application code:
Figure 1. The closed-loop cycle of performance optimization
Creating a methodology for your performance tuning project requires that you take this generic sequence of steps and determine how it should be manifest for your specific case. Taking the time to do so in advance of actually beginning the tuning phase of the project will give you a firm foundation that will prevent missteps later on.
What Not to Do
Before we leave this section, here are a few pitfalls to avoid:
The Software Optimization Cookbook, Second Edition is an updated classic that puts four Intel performance engineers at your disposal to help build a winning performance methodology.
The performance methodology planned in the previous section needs one more key ingredient before it can be put to use: a workload. The workload, in fact, lies at the heart of the tuning process, and choosing the right workload is vital to the success of your project. Its purpose is to give the application a set amount of work to do, so that the effects of performance tuning can be accurately measured during the tuning process.
Workloads Are Generally Home-Grown Affairs
When it comes to workloads, one size definitely does not fit all. Many development organizations ask Intel performance engineers to send them a workload that they can use in their tuning process. Unfortunately, this is typically not the appropriate course of action, because workloads are very application-specific. They have to be, in order to adequately exercise the software under test (see the characteristics of an effective workload in the following section).
One caveat is that industry standard benchmarks are special workloads that may in some cases be appropriate to use in a performance tuning implementation. Benchmarks are typically developed by a consortium or other governing body. They are generally built in order to provide reproducible results for a specific, fairly generic usage, such as measuring performance of an application server or database so that competing hardware or software vendors can compare performance.
Workloads can range from the fairly simple to the very complex. For example, a workload to test the performance of a file-compression utility might be as simple as a set of files generated using popular office software. In such a case, the files should include different types and sizes, and the number of files should be sufficiently large to make the operating time long enough to easily reveal differences in successive test runs.
In other cases, a more complex workload might be a set of transactions to be applied to a business intelligence reporting engine, for example, which could involve research into the various data types and sources that should be included. Because some report types might require data mining activities, the variables associated with multiple data sources, as well as network transmission, are factors that would have to be accommodated in the workload as well as the corresponding methodological approach.
Four Characteristics of Effective Workloads
Appropriate workloads must have the following characteristics:
What Not to Do
Now that we have discussed the primary characteristics of a well-wrought workload, it is valuable to consider the opposite, in order to round out that understanding. Following are some examples of common pitfalls that development organizations may fall prey to:
The Server Room Blog-Server Performance Tuning Habit #5: Know Your Workload demonstrates by analogy and example how to create and get the most out of a performance-tuning workload.
Once you have created a workload that meets the general criteria specified above, tailored to the specific needs of the application being optimized, it is time to establish the test environment in which you will perform the actual tuning.
Hardware Selection and Test-Bed Establishment
As mentioned before, a dedicated test machine is ideal for test purposes, since that practice minimizes the likelihood that uncontrolled changes will confound the test results. It is also appropriate to select a current platform instead of an old machine that may not be in use precisely because it has been replaced by a newer model. Performance tuning is a high-value activity and should be regarded as such, including the purchase if necessary and possible of a machine that closely emulates a typical target system in use by the software product's customers.
Once the test system has been chosen, a test fixture must be created in many cases to simulate the users or external systems that would interact with the test system under real-world conditions. The design complexity of the test fixture can be significant, depending on the requirements of the system, and it is necessary for the test fixture to be stable and efficient enough not to artificially impact test outcomes.
Considering Test Automation Tools and Processes
The cyclical nature of the test process suggests that many repetitive tasks are involved, and in fact, there is a positive correlation between the number of iterations and the quality of the end result. In plain terms, that means that software testing can become tedious, with the dual results that it can be unpleasant and that errors can creep into the process. Automating test procedures where it makes sense is a worthwhile approach to improve the efficiency and accuracy of results, as well as to provide a smart way out of tedium.
Various automation tools such as AutoIt*, AutoMate*, and QuickMacros* can play a significant role in this effort. There are a great number of them available, many of which are available free of charge. Your selection of tools in this area should allow maximum integration with the other processes and tools you already employ. While these tools are typically built with such integration in mind, custom scripting using approaches such as PERL* or Windows* PowerShell is another option that can provide additional flexibility.
Another worthwhile addition to your process is the free Intel Concurrency Checker, which helps to verify that application threads are running concurrently. You can use it to measure performance by running the application before and after you make specific code enhancements and comparing the measured results. The tool integrates into the automated test processes described here by running in a command-line/script mode, or alternately, you can choose to run it from a graphic user interface. One advantage of Intel Concurrency Checker is that it can be used without access to an application's source code, which makes it useful, for example, in testing the concurrency of another company's proprietary software that may impact the overall system.
Accommodating the Need to Test System Settings as Well as Code Changes
Note that, notwithstanding the advice given above that once you have established a hardware test system you should not inadvertently change any hardware components, performance tuning does sometimes involve making changes to the test system intentionally, such as a BIOS or server configuration setting. In such cases, the change made to the test system is the sole change being made for that test iteration (in the absence of code changes), so this practice still meets the precept outlined above of only changing one factor within the test environment at once.
Entrepreneur Network's Automation Software page rounds up a list of utilities that can be used to automate repetitive tasks like those involved in tuning.
Understanding the requirements around developing a tuning methodology and establishing a workload to take advantage of it are hard, exacting work, but they are the necessary foundations for improving software quality on parallel hardware. This advance work helps to ensure that your efforts will bear good results, with the eventual goal of releasing the best software possible and achieving the best competitive position for your product.
Once the tasks outlined in this paper are complete, you are ready to locate bottlenecks in your code. Turning those problems into opportunities is what performance tuning is all about.
Tell us about your efforts in performance tuning parallel applications, including what has worked and what hasn't, and connect with industry peers as well as Intel experts to help resolve any outstanding issues you have (or help someone else out):
Community Forum: Threading on Intel® Parallel Architectures
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804