by Padma Apparao
Performance optimization and profiling tools for Java applications must operate at the system level, the application level, and the micro-architectural level.
The rapid adoption of Managed Runtime Environments (MRTEs) by ISVs and developers represents a profound industry change, and managed runtimes pose unique problems to performance optimization and profiling tools. For example, most runtime applications do not have a steady state, which would be characterized by uniform application behavior. This non-uniformity is due mainly to garbage collection and due secondly to just-in-time (JIT) compilation that occurs while the application is running.
Garbage collection significantly impacts cache performance and bus traffic. Thus, it is necessary to segregate application behavior into two or more phases for analysis, which necessitates special tools that can delineate this behavior.
Another characteristic of runtime applications that complicates analysis is the fine granularity of their object-oriented code, where one does not see hotspots that take a large fraction of the application time. Instead, one sees many smaller hotspots, which present complexity when optimizing the application, since it is not immediately clear which methods to focus the optimization effort on.
JIT compilation adds further complexity to the program behavior. Application behavior may vary significantly during its duration, and the runtime behavior may change from run to run. The role of the JIT is crucial in developing tools for profiling runtime applications.
Run-Time Applications Place Specific Demands on Analysis Tools
Applications that exhibit non-steady-state behavior pose difficulties during characterization, since a transaction may not behave the same way during the duration of the run. This makes understanding the behavior of the program more difficult and adds certain constraints to the analysis tools.
If one is tracking the lifetime of a transaction, but the transaction is changing its behavior dynamically, any analysis derived from that individual transaction is of limited authority. One therefore needs to understand the impact of this changing behavior on the rest of the application. Static applications may also have issues associated with steady-state behavior, but they are more pronounced in run-time applications because of dynamic method compilation and garbage collection.
Since garbage collection occurs during certain intervals of time, it is quite easy to demarcate these phases during the run and to study the distinctive behavior of garbage collection. Thus, it is necessary for one to determine the homogeneity of the application. In Figure 1 we see an example of two different applications, one of which is steady and the other of which is non-steady.
The steady-state application has a uniform profile of Instructions Retired events over the entire measurement interval. One can safely say that this application falls within the category of a homogeneous application. The non-steady-state application, however, exhibits spiky behavior in its Instructions Retired event profile. When analyzing this application, therefore, one must be aware of the pitfalls in correlating performance metrics between different phases of the application.
Figure 1: Workload Homogeneity
Java applications tend to be overly synchronized, and synchronization is expensive, even when there is no contention. We have observed that locks are typically contended for less than 10% of the time and that about 10% of the time is spent in lock-contention issues.
Profiling tools should monitor lock contention, in terms of how many locks are in the system at any time, how often a lock is contended, how long a lock is held, how many methods are usually contending for the lock at any time, and so on. Certain tools may also provide coaching assistance to improve synchronization constructs based on run-time behaviors.
Workloads in run-time environments tend to be object-oriented with many small methods. This characteristic introduces many calls to and returns from the methods, which leads to branchy code. When optimizing these applications, one should identify opportunities to in-line these short methods, making sure not to cause code-bloat issues.
To glean this knowledge from the application, the tools must provide a means of tracking method calls and returns, recording the number of times a method is called, and tracking the method to its caller ancestry, without being intrusive on performance.
Multiple Levels of Tools are Necessary to Provide Complete Analysis
In order to look at hardware-performance counters, one must have non-intrusive tools so that they do not perturb the system under test. At the application level, one needs profiling tools to look at the behavior of the application non-intrusively. System-level performance monitoring tools provide information about the overall system that is running the application. The tool must provide relevant, accurate, and precise data.
System-level tools provide data regarding the various hardware resources in the system, such as memory, disks, network bandwidth, and processors. Developers should use system-level profiling first to identify bottlenecks in the system and to remove them. The next step is to do application profiling, whereby one can discover any lock/thread contentions in the application. Finally, one should use architectural-level tools that look at hardware counters to uncover hardware-related issues such as excessive branch mispredictions and cache misses.
Windows* and Linux* Both Support Comprehensive Profiling Tools
At the system level, tools like Perfmon* for Windows and IOStat* and Sar* for Linux are available. All these tools are quite powerful and give sufficient information to identify system bottlenecks. APImon* for Windows and strace and ltrace for Linux identify system calls. Note that Linux tools are specific to individual kernel builds.
At the application level, a number of tools are available, including Intel® VTune™ Performance Analyzer, JProbe*, Optimize it*, HProf*, HAT*, etc. These tools are largely neutral to the environment they run in; they help to identify potential code optimizations in either Linux or Windows. Some of these tools, including the VTune analyzer, can pinpoint precise pieces of code where performance issues occur.
Figure 2: Analyzing Object Allocation using Hprof
Figure 2 shows data collected by HProf and analyzed using the HPJmeter* Hprof analyzer tool. The sample includes about 2200 objects of the msjava/hdom/element class, and these objects are garbage collected about 98% of the time, indicating that these objects are short-lived. Some objects like the java/util/hashmap objects are found at levels of less than 250, but they are never garbage collected, which indicates that these objects have a longer lifetime and tend to linger until the end of the application.
Using this analysis, the developer can determine where the objects are being allocated – for example, which method is allocating the element objects – and then determine whether they can make optimizations to the method such as creating fewer objects or more efficiently reusing them.
Figure 3 shows the residual objects left in the heap after the application has terminated, as well as the methods that allocated these objects. One can use this analysis to identify memory leaks in an application.
Figure 3: Residual Objects
Figure 4 shows the call graph with respect to the time spent in the CPU. One can look at the time spent in the different methods and the calling sequence for that method. For example, the method java.lang.ThreadLocal.get is invoked by the dispatch method that contributes to 6% of the time spent in the java.lang.ThreadLocal.get method. There are five other callers of that method; one can get that information by double clicking on the method, which shows that the BTCConverter actually contributes more to the java.lang.ThreadLocal.get method.
Figure 4: Call Graph and Caller List
One can also study the effect of garbage collection on application performance. For example, when tuning the garbage collection options, using the "verbose:gc" statistics and an analyzer tool like Tagtraum GCViewer* can reveal the effects of changing various garbage-collection parameters. Figure 5 shows that garbage collection takes 29 seconds out of the total 230 seconds application time, corresponding to a throughput of 87% time in the application. Similarly, Figure 6 shows that changing the garbage-collection parameters reduces the garbage-collection cycle time by 44%, which corresponds to a throughput increase of 6%.
Figure 5: Garbage Collection Statistics
Figure 6: Effect of Tuning Garbage Collection Threads
Lock profiling, another aspect of application profiling, identifies how many locks are in the system at any given time, how many are being held and contended for, and the maximum and average contention times. This information is particularly useful in improving application scalability.
JIT profiling focuses on understanding what methods are JIT compiled and how often. One can use this profile to decide whether if it is better to in-line the methods or to do function splitting.
Thread profiling detects race conditions between threads, identifying and predicting deadlocks. JProbe*, Optimizeit* and VTune Performance Analyzer are some of the well-known application profilers available. Developers should try all of these tools to determine which suits a particular application. Of course, one must also validate the tools and consider their intrusiveness before doing any analysis.
The tools at the architectural level are those that read hardware performance-monitoring counters exposed by the processor architecture. The VTune analyzer and Emon provide the means to analyze applications with respect to the processor architecture. One can drill down into the application with regard to methods exhibiting more cache misses than expected, high numbers of branch mispredictions, and so on. More details on obtaining VTune can be obtained from Intel® VTune™ Performance Analyzer.
Different kinds of profiling can be done on your application; this paper has discussed heap and garbage collections and some of the key aspects of runtime applications in detail. Other aspects of profiling include thread analyzers, JITted code analyzers, and lock contention analyzers. All of these tools use the Java Virtual Machine Profiling Interface (JVMPI)* to perform sampling and to gather other information about the application at runtime.
This profiling interface is expensive when it comes to heap profiling, and it is extremely intrusive. It can be expected that the throughput of an application will drop anywhere from 10x to 100x when using the JVMPI interface. JITted code profiling is not that intrusive, but one should expect a 2x or more performance drop. Another mechanism used by the tools is instrumentation, which gives accurate results but can also be intrusive.
A talk-back forum on the topic of MRTEs was hosted by Intel® Developer Services. Developers were welcomed to weigh in on how runtimes have changed the way they write code and to read what others have to say.
Intel, the world's largest chipmaker, also provides an array of value-added products and information to software developers:
- Intel® Software Partner Home provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share.
- Intel® Developer Zone offers free articles and training to help software developers maximize code performance and minimize time and effort.
- Intel® Software Development Products is for information about Intel software including Compilers, Performance Analyzers, Performance Libraries and Threading Tools.
- IT@Intel, through a series of white papers, case studies, and other materials, describes the lessons it has learned in identifying, evaluating, and deploying new technologies.
About the Author
Padma Apparao is a Senior Performance Architect working in the Managed Runtime Environments group within the Software Solutions Group at Intel. Padma has been with Intel seven years, working on performance analysis and optimizations on several workloads and industry standard benchmarks like TPC-C TPC-H, and SPECjbb2000. Her focus is currently on XML processing; she is involved in understanding the evolution of XML and how processor architecture can influence the future of XML performance. She obtained a PhD in Computer Science on Distributed Algorithms and Systems from the University of Florida, Gainesville in 1995.