by James Rose
Sr. Application Engineer, Intel Corporation
Core Software Division, Mobility Engineering
The introduction of the Intel® Pentium® M processor as part of Intel® Centrino® mobile technology adds powerful new capabilities to mobile platforms in terms of performance and great battery life. If you are a software engineer interested in ensuring the best performance and power consumption on Pentium M processor based systems, one of the most important tools to help you optimize your application is the Intel VTune™ Performance Analyzer. In this paper, I describe how you can use the VTune Analyzer to optimize your mobile-aware application for performance and great battery life. This includes an overview of new support specific to the Pentium M processor, new events, event groups and ratios, automatic generation of tuning advice with the tuning assistant, battery and other power optimization support. I also describe some of the ways you can use VTune to characterize Pentium M specific performance and power issues.
Of course there are several steps that you can take while engineering you application to help increase your chances of excellent performance before you begin optimizing it with VTune. For tips on how to get the most performance out of your application, please refer to the document Optimizing Software for Intel® Centrino® Mobile Technology and Intel NetBurst™ Microarchitecture.
New Events for the Pentium M Processor
Several important events were added to the Pentium M processor that help characterize performance metrics in addition to those provided on previous Pentium III processors. These new events allow measurement of the following performance metrics:
System power state transitions (voltage, frequency, all-also known as Intel SpeedStep® Technology), thermal trip event
Many events relating to branch prediction (conditional and unconditional, direct and indirect, calls, returns)
Micro-ops fusion effectiveness
Partial stalls (cycles and events)
Hardware data prefetcher and software prefetch instructions
Using these events and existing events from the Pentium III family of processors allows significant performance characterization and tuning for Pentium M processors in VTune 7.1, for both performance and power consumption-based performance issues.
Automatic Generation of Tuning Advice with the Pentium M Processor Tuning Assistant
An important new capability added to VTune 7.1 is automatic generation of tuning advice with the VTune Tuning Assistant. There are three levels of advice offered in the wizard, depending on your desired level of characterization detail and the number of workload runs. These three levels are:
- Application level tuning – CPI only
- Application and basic microarchitectural tuning
- Advanced application and microarchitectural tuning with several workload runs
For many applications, best results are achieved by selecting “Many workload runs with more advice (application and microarchitecture-level tuning.” This level ensures a thorough baseline characterization of your application and that all the most common potential performance issues for the Pentium M processor have been identified. These may include performance limiting issues such as excessive partial stalls, excessive resource-related stalls, excessive branch misprediction, and memory performance issues among others. Depending on these results, more investigation may be needed to drill down and identify the specific causes of performance problems.
Creating a new VTune Project with Automatic Tuning Advice
It’s relatively straightforward to create a new project using VTune 7.1. The VTune documentation is well written and you should be able to quickly find answers if you need help. For most processor-based software performance tuning, the Sampling Wizard is the best choice. After you have created a new project with the Sampling Wizard or the Quick Performance Analysis Wizard, the Sampling Configuration Wizard appears as shown in figure 1. As I mentioned earlier, a good recommended baseline for performance characterization is to select the Automatically generate tuning advice option with Many workload runs with more advice (application and microarchitecture-level tuning). This causes 12 different runs of your application, each time measuring different events for performance characterization.
Figure 1. Sampling Configuration Wizard pg. 1
Once you have selected options from the first page, you are directed to set up other important characteristics for sampling. In general, the default settings work well for most applications that can be started and run with command line arguments, but other modifications and configuration settings may be required if your application requires GUI-based setup or manipulation to run properly. In many cases, a GUI-based application is sampled by specifying no application to launch for sampling and allowing VTune to cycle through event sampling in a fixed interval (typically 20 seconds). These kind of sampling scenarios work best if the application workload is fairly consistent throughout the sampling run in terms of CPU load and data set. Please consult the VTune documentation to help determine the best way to sample your application.
If you selected automatic generation of Tuning Advice with many workload runs in the first dialog, the best choice in the Event Groups field is the group “Events for Tuning Assistant Advice,” as shown in figure 2.
Figure 2: Sampling Configuration Wizard for Windows*/Linux*
After the configuration is complete with these settings, your application is sampled and characterized for the most significant Pentium M processor related performance issues and a report containing specific tuning advice is generated.
Interpreting the Tuning Assistant Report
The tuning assistant report has been slightly restructured for version 7.1 and is now divided into 5 insight categories as shown in figure 3:
- Top Insights
- Workload Insights
- Module Insights
- Hotspot Insights
- System Info
Top Insights includes the most significant performance issues based on potential performance impact, regardless of module or application. Workload Insights include possible performance issues listed for all processes and modules. The Module insights section includes insights based on modules such as executables and libraries. Hotspot Insights includes insights on a per function basis sorted by percentage of CPU time regardless of module. System Info provides a summary of important system features such as processor speed, overall application runtime, operating system, etc. For a more detailed explanation of these insight categories, please refer to the VTune 7.1 documentation.
Figure 3: Tuning Assistant Report
For baseline performance characterization, the most important insights are included in the Top Insights section of the Report, although other significant issues may be included in other sections of the report as well. Focusing on the issues included in the Top Insights section of the report helps ensure that you have addressed the most important performance issues or limitations that may be present your application. A sample insight window is shown in Figure 4. Where possible, each insight includes links to more details and information about the insight itself, such as the specific event counters that were used to calculate it, and how relevant the issue is to the performance of your application. Some of these insights may include specific information to correct potential performance problems while others may require more investigation, using additional event sampling before the root cause of the performance issues is determined.
The tuning assistant and sampling results view in VTune also contain important information about the impact of performance issues. There are three kinds of performance impacts that are computed by VTune to help quantify the effect in percentage of execution of a few important performance issues. Performance impacts can help you quantify the maximum possible potential improvement from optimizations with respect to specific performance problems. For example, branch misprediction performance impact refers to the percentage of cycles that are lost from mispredicted branches. If your application had a branch misprediction performance impact of 10%, the best possible improvement attainable by optimizing and removing all branch predictions would be 10%. Even though removing all mispredicted branches would be impossible to do, you can still get an upper bound on potential performance improvements. VTune supports three main Pentium M processor performance impacts: partial stalls ratio, branch misprediction performance impact, and resource stalls ratio. Resource stalls ratio represents several kinds of potential performance stalls including memory renaming buffer, memory buffer, branch misprediction recovery, delay in retiring mispredicted branches, etc. It may be difficult to resolve some of these performance problems o r even identify which specific issue is causing the stall, but it may also be possible to resolve some of these issues and increase the performance of your application. Please refer to the VTune documentation and the Intel® 64 and IA-32 Architectures Software Developer's Manual for details on how to reduce the impact of these problems.
Figure 4: Tuning Assistant Insight View
Using Event Groups and Ratios for More Detailed Performance Analysis
As I mentioned previously, it may be necessary to sample additional events to determine the root causes of some performance issues. Event Groups and Ratios are predefined groups of events that help you to further characterize identified problems or evaluate how effectively your application uses advanced features of the Pentium M processor such as the MMX, SSE and SSE2 SIMD instruction sets. Event Groups and Ratios help to ensure that you are including the most important events for detailed root causing of performance issues.
Some of the most important groups include:
- Events for Detailed Tuning Assistant Advice
- Events significant to Pentium-M processor development
- Power Management Events (SpeedStep Technology)
- Prefetch Events
Event Ratios can also provide more specific detail to diagnose real issues.
Some of the most useful ratio groups include:
- Primary Performance Tuning
- Memory Statistics
For example, if an application was experiencing an increased number of 1st or 2nd level cache misses, you could select the “events used to monitor memory and cache activities” group for more detailed analysis. See the VTune documentation for details on how to create or modify an activity to include new event groups or event ratios.
With mobile computing, power consumption and its affect on battery life is a very important issue. Small reductions in power usage can help in obtaining great battery life and allow the user more time to do work. One of the best ways to obtain great battery life is to make sure that code is optimized for performance and that tasks are finished as quickly and efficiently as possible. The Pentium M processor is designed with SpeedStep® technology which allows it to run at reduced voltage and frequency when CPU utilization isn’t high. One of the key ways to produce power consumption is to enable your application to run at lower power states. It is almost always beneficial for applications to finish processing at higher power states as quickly as possible so that the processor and system can enter a lower power state. There are a few groups of events and techniques you can use with the VTune Analyzer that can help you understand some of the most important power usage characteristics of your application and help you ensure that your application uses power more efficiently.
Using VTune to Help Characteriz e Power State Transitions
While it may be possible to optimize applications for performance so that they can finish faster, what about long-running or background applications that run for an extended time? Applications that run for extended amounts of time perform better by running at lower power states (using reduced frequency and voltage) when that is possible. With the new events added to the Pentium M processor to measure these transitions, coupled with a new feature added in VTune 7.1 called “sampling over time views,” you can better understand the SpeedStep frequency and voltage changes in the processor while running your application. Using this information, you may find situations which may indicate decreased power efficiency and find ways to optimize for reduced power consumption by running the processor at lower power states. In the case of an application with an excessive number of SpeedStep transitions, it may be beneficial to spread the work out over time so that the processor can remain in a lower power state, rather than repeatedly switching between high and low power states.
Using VTune to View SpeedStep Transitions
An important new capability added in VTune 7.1 is the ability to view events as they are sampled over time in an application. This differs from the normal view of events in VTune that shows the cumulative events for a section of code during the entire workload runtime. In sampling over time view, VTune shows the relative counts of events as they were sampled over time. Using this capability, you can determine the points in time when your application is causing power state transitions and potentially optimize your application for more efficient use of resources. As I mentioned previously, it may be beneficial to engineer your application to reduce excessive SpeedStep power state transitions so that your application can run at a lower power state.
It is fairly straightforward to use the sampling over time view in VTune 7.1 for SpeedStep transitions. When you create a new activity for ISpeedStep Technology Transitions (All Transitions), just ensure that clockticks is also included in the sample. After you have sampled your application, you can access sampling over time views by clicking the hourglass-shaped icon on the toolbar. Figure 5 shows sampling results for SpeedStep transitions in sampling over time view.
Figure 5 SpeedStep Transitions in Sampling over Time View
Using VTune to characterize Battery Status and Discharge Rate
Measuring power state transitions can help to reduce inefficiencies caused by excessive power state transitions. But how do you know if your optimizations have an overall beneficial affect on battery life? Aside from running your application over and over again on a mobile system on battery power until the battery runs out, you can use the BatteryStatus performance counter with VTune. Using VTune Counter Monitor with the BatteryStatus performance object allows you to measure the battery discharge rate of your application over the time of your workload. In order to measure Battery Status discharge rate, your mobile system must be running on battery power without an AC power connection. The default Counter Monitor view in VTune shows values in a time-based s ampling interval so that you can see changes in the discharge rate of the battery as your application runs through its workload. For information about how to create an activity using counter monitor, please refer to the VTune documentation. Figure 6 shows the Counter Monitor Configuration dialog with BatteryStatus and DischargeRate. Figure 7 shows a sampling run of Counter Monitor using the BatteryStatus Discharge Rate. Note in this example that you can see an upward change in the discharge rate at a specific point in the sampling period. Time intervals with abrupt changes in discharge rate may indicate areas to focus on in terms of obtaining great battery life. For more details on optimizing applications for great battery life, see the white paper Application Power Enhancements for Mobility.
Figure 6. Battery Status Counter Monitor Configuration
Figure 7. Battery Discharge Rate using Counter Monitor.
VTune 7.1 adds some great new capabilities that can help you ensure that your mobile application doesn’t suffer from some of the most common performance problems and limitations. Automatic generation of Tuning Advice can help you to focus on the most significant performance problems and understand steps to take to fix those problems, or how to investigate further to identify the root cause of performance issues.
VTune 7.1 also enables you to measure SpeedStep Technology transitions and identify when excessive power state transitions may be occurring, alerting you to potential power optimization opportunities. Using Counter Monitor, you can also measure the battery discharge rate while running your application. This allows you to quantify the impact of any optimizations that you may include to reduce power consumption. These features can help you ensure that users of your mobile-aware application have optimized performance and great battery life allowing them to be more productive and get more work done.
- Applications Power Management for Mobility
- Power Management: Designing Applications to Conserve Battery Life
- Optimizing Software for Intel® Centrino® Mobile Technology and the Intel NetBurst Microarchitecture