Improving software performance on parallel software requires a structured approach that makes good use of development resources, obtaining good results quickly. This paper breaks down such an approach into three distinct stages:
- Stage One: Establishing a Tuning Methodology. Best practices help you plan ahead and stay with the plan.
- Stage Two: Creating a Suitable Workload. A consistent amount of work for the application to do before and after tuning lets you measure progress.
- Stage Three: Building the Test Environment. A proper test environment accurately emulates production within your empiric process.
Together, these three stages help software development organizations optimize their software efficiently, evolving it to take advantage of increasing numbers of processor cores.
Stage One: Establishing a Tuning Methodology
The most important thing to keep in mind when optimizing an application is to create a systematic approach and to stay with it. That means adhering to scientific method, as well as planning what you intend to do and then carrying through with your plan.
Starting Off on the Right Foot
A few general precepts will be valuable to you from the moment you begin your first conversations in the planning process until you decide your application has reached its ultimate goal:
- Establish goals and a coherent methodology to reach them. Optimization is hard, potentially costly work, and you need to know what you are trying to accomplish so you know when you've been successful. Example goals might be to reach a specific processor utilization on each core or to enhance the threading model to enable a new feature to operate in real time during execution.
- Identify the effect of every change you make along the way. Having a suitable workload (discussed below) allows you to measure results as you go, so that you know exactly what happens as a result of every change as you make it.
- Keep copious notes. Write down everything you do to the software and why. The notes you take during the present project are the basis for a knowledge base (even if it's an informal one) that can make your next project more successful. Those notes should also include findings such as shortcomings in workload development and how you overcame them.
Identifying the Steps
With these rules firmly in hand, it is time to create a coherent methodology that allows you to decide what changes to make, enact those changes, and measure their effect on the performance of the application. The first thing to understand here is that this is an iterative, closed-loop cycle, as shown in Figure 1. Each sequence of steps (one iteration around the circle) designs, implements, and verifies one (and only one) change to the application code:
- Gather Performance Data. The first step of each iteration is to apply the workload and measure performance using the appropriate metric (workloads and metrics are discussed later in this paper). Intel® Concurrency Checker, also discussed below, can help identify thread behavior in this step.
- Analyze Data and Identify Issues. Next, you must identify opportunities for improvement, often using tools such as Intel® VTune Performance Analyzer to see where time is being spent or Intel® Thread Profiler to discover threading inefficiencies.
- Generate Alternatives to Resolve an Issue. For every problem, there is a solution, and this step identifies the fix that will resolve the current issue.
- Implement Enhancement. Once you have decided what you plan to change and taken note of it in your project record, make the appropriate change to the code.
- Test Results. Collect additional data, compare it to the baseline measurement, and take the time to understand the results, backtracking if necessary.
Figure 1. The closed-loop cycle of performance optimization
Creating a methodology for your performance tuning project requires that you take this generic sequence of steps and determine how it should be manifest for your specific case. Taking the time to do so in advance of actually beginning the tuning phase of the project will give you a firm foundation that will prevent missteps later on.
What Not to Do
Before we leave this section, here are a few pitfalls to avoid:
- Don't neglect the full breadth of end-user systems. The platforms you test your application on should represent the end-user environment. Optimizing for only older systems on one hand or bleeding-edge ones on the other can be a poor approximation of what your customers can expect, so engineer for the future but deliver appropriate performance on legacy machines as well.
- Don't change more than one thing at a time. Particularly in optimizations for parallel platforms, the cause and effect of changes you make may not be what you expect. If you make more than one change between measurements, their effects may counteract one another, or if the results are negative, you won't know which change is responsible.
- Don't mess with the hardware. If at all possible, dedicate a computer system as the test bed, and don't make any changes to it during the tuning process. Just installing or updating other software can interfere with your test data. Another option is to use a system image or restore point to ensure that you are testing on the same hardware configuration each time.
The Software Optimization Cookbook, Second Edition is an updated classic that puts four Intel performance engineers at your disposal to help build a winning performance methodology.
Stage Two: Creating a Suitable Workload
The performance methodology planned in the previous section needs one more key ingredient before it can be put to use: a workload. The workload, in fact, lies at the heart of the tuning process, and choosing the right workload is vital to the success of your project. Its purpose is to give the application a set amount of work to do, so that the effects of performance tuning can be accurately measured during the tuning process.
Workloads Are Generally Home-Grown Affairs
When it comes to workloads, one size definitely does not fit all. Many development organizations ask Intel performance engineers to send them a workload that they can use in their tuning process. Unfortunately, this is typically not the appropriate course of action, because workloads are very application-specific. They have to be, in order to adequately exercise the software under test (see the characteristics of an effective workload in the following section).
One caveat is that industry standard benchmarks are special workloads that may in some cases be appropriate to use in a performance tuning implementation. Benchmarks are typically developed by a consortium or other governing body. They are generally built in order to provide reproducible results for a specific, fairly generic usage, such as measuring performance of an application server or database so that competing hardware or software vendors can compare performance.
Workloads can range from the fairly simple to the very complex. For example, a workload to test the performance of a file-compression utility might be as simple as a set of files generated using popular office software. In such a case, the files should include different types and sizes, and the number of files should be sufficiently large to make the operating time long enough to easily reveal differences in successive test runs.
In other cases, a more complex workload might be a set of transactions to be applied to a business intelligence reporting engine, for example, which could involve research into the various data types and sources that should be included. Because some report types might require data mining activities, the variables associated with multiple data sources, as well as network transmission, are factors that would have to be accommodated in the workload as well as the corresponding methodological approach.
Four Characteristics of Effective Workloads
Appropriate workloads must have the following characteristics:
- Measurable. There must be a consistent, reliable means of measuring the application's performance while running the workload. The metric used will vary by the type of software application under test; for example, a game might use frames per second, an e-commerce app might use transactions per second, and a network filter might use packets per second.
- Repeatable. Given the same set of circumstances, the workload should produce as close as possible to the same results over and over. Small variations may occur due to uncontrollable effects such as cache state and operating system background tasks, but they must be small enough to avoid hiding your test results. Turning off applications like firewalls and virus checkers, as well as increasing the size of the workload or length of the test run can help in minimizing extraneous effects.
- Static. The measurements associated with a workload must not vary with time. An example of a case where this issue would interfere with the test process is when the workload performs heavy file I/O, gradually filling the disk so that file read and write operations gradually take longer to perform, decreasing the performance of the test independently of changes made as part of the tuning methodology.
- Representative. The work being performed must be typical of the stress put on the system under normal operating conditions. It should exercise as much of the code base as possible while also emulating a normal usage scenario, as opposed to focusing, for example, on a specific part of the application that the team has decided is of interest beforehand.
What Not to Do
Now that we have discussed the primary characteristics of a well-wrought workload, it is valuable to consider the opposite, in order to round out that understanding. Following are some examples of common pitfalls that development organizations may fall prey to:
- Don't choose too small a chunk of work. If the amount of work undertaken during the test is too small, changes in test results may not stand out. For example, if a workload runs in ten seconds in the initial test and 9.5 seconds in the next round, the difference would be difficult to detect using a stopwatch, even though it represents a five percent performance improvement. Note that in some cases, this issue can be solved simply by measuring how many times a workload can run in a given period, rather than how long it takes to run the workload once.
- Don't focus on just a subset of real-world data types. If a workload does not include the full range of possibilities, it may not accurately represent changes to performance that are most detectable using other data types. For example, if a business intelligence application has a bottleneck importing a certain data type, resolving that issue requires that data type to be considered in the design of the workload.
- Don't spend too much time trying to create the perfect workload. There is no perfect workload for all purposes, and the goal should be to create one that quantifies performance with good code coverage and follows the guidelines given above. Establish a timeline for creating the workload in advance and stay within the time constraints you have set for yourself.
The Server Room Blog-Server Performance Tuning Habit #5: Know Your Workload demonstrates by analogy and example how to create and get the most out of a performance-tuning workload.
Stage Three: Building the Test Environment
Once you have created a workload that meets the general criteria specified above, tailored to the specific needs of the application being optimized, it is time to establish the test environment in which you will perform the actual tuning.
Hardware Selection and Test-Bed Establishment
As mentioned before, a dedicated test machine is ideal for test purposes, since that practice minimizes the likelihood that uncontrolled changes will confound the test results. It is also appropriate to select a current platform instead of an old machine that may not be in use precisely because it has been replaced by a newer model. Performance tuning is a high-value activity and should be regarded as such, including the purchase if necessary and possible of a machine that closely emulates a typical target system in use by the software product's customers.
Once the test system has been chosen, a test fixture must be created in many cases to simulate the users or external systems that would interact with the test system under real-world conditions. The design complexity of the test fixture can be significant, depending on the requirements of the system, and it is necessary for the test fixture to be stable and efficient enough not to artificially impact test outcomes.
Considering Test Automation Tools and Processes
The cyclical nature of the test process suggests that many repetitive tasks are involved, and in fact, there is a positive correlation between the number of iterations and the quality of the end result. In plain terms, that means that software testing can become tedious, with the dual results that it can be unpleasant and that errors can creep into the process. Automating test procedures where it makes sense is a worthwhile approach to improve the efficiency and accuracy of results, as well as to provide a smart way out of tedium.
Various automation tools such as AutoIt*, AutoMate*, and QuickMacros* can play a significant role in this effort. There are a great number of them available, many of which are available free of charge. Your selection of tools in this area should allow maximum integration with the other processes and tools you already employ. While these tools are typically built with such integration in mind, custom scripting using approaches such as PERL* or Windows* PowerShell is another option that can provide additional flexibility.
Another worthwhile addition to your process is the free Intel Concurrency Checker, which helps to verify that application threads are running concurrently. You can use it to measure performance by running the application before and after you make specific code enhancements and comparing the measured results. The tool integrates into the automated test processes described here by running in a command-line/script mode, or alternately, you can choose to run it from a graphic user interface. One advantage of Intel Concurrency Checker is that it can be used without access to an application's source code, which makes it useful, for example, in testing the concurrency of another company's proprietary software that may impact the overall system.
Accommodating the Need to Test System Settings as Well as Code Changes
Note that, notwithstanding the advice given above that once you have established a hardware test system you should not inadvertently change any hardware components, performance tuning does sometimes involve making changes to the test system intentionally, such as a BIOS or server configuration setting. In such cases, the change made to the test system is the sole change being made for that test iteration (in the absence of code changes), so this practice still meets the precept outlined above of only changing one factor within the test environment at once.
Entrepreneur Network's Automation Software page rounds up a list of utilities that can be used to automate repetitive tasks like those involved in tuning.
Understanding the requirements around developing a tuning methodology and establishing a workload to take advantage of it are hard, exacting work, but they are the necessary foundations for improving software quality on parallel hardware. This advance work helps to ensure that your efforts will bear good results, with the eventual goal of releasing the best software possible and achieving the best competitive position for your product.
Once the tasks outlined in this paper are complete, you are ready to locate bottlenecks in your code. Turning those problems into opportunities is what performance tuning is all about.
Share in the Community
Tell us about your efforts in performance tuning parallel applications, including what has worked and what hasn't, and connect with industry peers as well as Intel experts to help resolve any outstanding issues you have (or help someone else out):
Community Forum: Threading on Intel® Parallel Architectures
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.