by Gary Carleton, Intel Corporation
Intel® Software Development Products help Java* and .NET* programmers to write more efficient, powerful and elegant code.
There seem to be features of the VTune™ Performance Analyzer that many users don't know exist. Those of us in the VTune Analyzer group often get questions or suggestions for new capabilities that are already in the tool. With this white paper we hope to let you know about some of the lesser-known features of the analyzer, and perhaps give you some new ideas on ways to use it.
The user can change the default settings of the various displays of the VTune analyzer. Most of them are available from the Configure > Options menu. They are organized into the different display views. Perhaps the most commonly used (or controversial) category is the sampling view. Parameters that can be changed are:
- Linear vs. logarithmic display
- Bar chart vs. table data display
- "Fit-in-window" that compresses all the data into one non-scrollable view
- The ability for the VTune analyzer to remember the settings from the previous view.
Call Graph Features
There are a number of special capabilities supported by the Call Graph mode of the VTune analyzer. The most important is probably Wait Time. This data item shows the total time a method spent blocked on some type of synchronization event, such as WaitForMultipleObjects. The wait time data can be displayed and sorted in a spreadsheet format that allows the user to easily identify the methods that spent the most time blocked. The data is also added for each individual thread and calling sequence to again identify those areas where performance problems may occur due to synchronization events.
Another key feature recently added to the VTune analyzer is the ability to instrument binary images prior to executing the binary so that Call Graph data can be gathered without the VTune analyzer having to invoke the application explicitly. This can be particularly useful when analyzing server binaries that are loaded at boot time or in response to receiving an asynchronous event.
Since the Call Graph data gathering methodology is based on binary instrumentation, there is some execution time overhead involved in running the instrumented application. The VTune Analyzer allows the user to control the amount of instrumentation to a very fine degree. The user can specify how much instrumentation (all functions, only exports, or nothing) to include for each different binary image (or DLL) used in running the application. The user can identify individual functions that should or should not be included in the Call Graph. These features allow the user to control the amount of overhead in the application by enabling or disabling instrumentation all the way down to the level of individual methods.
In the past we have received feedback related to trying to navigate the graphical part of the output display for large applications. As a result, we added an Overview pane that can be invoked by right-clicking the call graph output. It displays the entire Call Graph with the current view outlined. This view gives the user the high-level picture for the complete Call Graph relative to the currently displayed portion.
Result Comparison and Merging
When a change is made to a program, how did the performance change? To answer this question we implemented the ability to visually compare the results from multiple performance runs. To compare two results, use the Project Navigator panel (from the View > ProjectNavigator menu), to display the initial set of results. Then click and drag the other set of results to be compared into the viewing pane, and the data for the second set of results will be interleaved with the original set. The user can now easily see the changes in each software component (for example, each source code function in the Hotspot view) to see whether any performance improvements or regressions occurred.
The results of multiple sessions can be merged together to form one set of viewable results. This can be useful if there is more than one set of input data for the program being measured. Simply select the first set of performance results in the Project Navigator, then right-click the second set, and select Merge Results from the menu. The merged data appears as a new set of activity results in the Project Navigator window.
The VTune Performance Analyzer has the ability measure performance-sensitive CPU events using a technique we call Event Based Sampling (EBS). While we found it useful to measure things like cache misses and other CPU activities, trying to determine whether there are so many of them that software performance is affected can sometimes be tricky. To help with this the VTune analyzer has Event Ratios that display rates of events, not just number of occurrences. This helps give the user a more intuitive feeling as to whether a performance-sensitive event is likely to be slowing down the program.
To enable Event Ratios prior to measuring performance, use the Event Ratios tab when selecting which CPU events to measure, and select the type of ratio you are interested in, such as Memory Statistics for cache operations.
The user can also define new event ratios or edit existing ones by using the Configure > Ratios menu command. Then select the category for the ratio and select Edit_Ratio or New_Ratio as appropriate. You will then be able to enter the formula for the ratio.
.NET* and Java* Support
The support for .NET and Java* are similar. In both cases the VTune analyzer's normal non-intrusive sampling and instrumented Call Graph are both available. The .NET and Java environments can make use of the operating system's performance counters using the VTune analyzer's Counter Monitor feature. All three of these features are used as they normally are with native code: Sampling can be used to identify CPU bottlenecks, Call Graph can be used to identify the critical path and blocked code in a program, and Counter Monitor can be used to sample OS or .NET counters to identify particular time sli ces during a performance run in which anomalous counter values may appear.
For details on which .NET and Java environments are supported see the Release Notes at: /en-us/.
When analyzing Sampling data, especially during the process of drilling down to source code, the user can select multiple items, not just one, for further detailed analysis. This causes the VTune analyzer to process and display the performance data for the selected items as if they were all one item. There are situations where this is very useful:
- Analyzing multiple DLLs - Many applications are implemented using multiple DLLs. In the Module View, which shows CPU usage for each individual executable image, multiple selection allows the user to analyze more than one DLL in the hotspot view. With the Hotspot view grouped by function, the graph shows all the functions in the selected DLLs sorted by CPU time.
- Analyzing a Thread Pool - This same technique can be used in the Thread View to select all the member threads in a thread pool, causing the VTune analyzer to treat the pool as if it were one thread.
The VTune analyzer allows the user to gather performance data programmatically without having to manually invoke the VTune analyzer. This is useful in cases where it makes sense to gather performance data automatically, for example, after every new build. The user can create a batch file that directs the VTune analyzer as to how the data is to be gathered. The batch file can be either a Windows* scripting language that supports Microsoft ActiveX* scripting (such as VBScript*, Jscript*, PerlScript*, Pscript*, Python*) or a programming language that supports COM (Microsoft VisualC++*, Borland Delphi*). For more details, check the VTune analyzer Help index for Batch Mode.
The VTune analyzer has many capabilities, and it is sometimes difficult to learn everything it has to offer. Some of the less well-known features have been described here, but probably the best ways to learn more are by using the Getting Started Tutorial (available in the VTune Analyzer under Help > GettingStartedTutorial, or on the Web or by simply using the VTune analyzer's extensive Help system.
About the Author
Gary Carleton works at Intel on software performance tools, including the Intel VTune Performance Analyzer and C++ Compiler. He has been an engineering manager and software engineer for Intel, Cadre Technologies and Kaiser Engineers. He has a B.S. in electrical engineering and computer sciences from the University of California at Berkeley.
The information, opinions, and recommendations in this column are provided by the Author, Gary Carleton. Intel and its subsidiaries do not necessarily endorse or represent the accuracy of the Author's information opinions or recommendations and any reliance upon the Author's statements is solely at your own risk.