Java support is back in VTune™ Amplifier XE

Dedicated users of the previous generation’s VTune™ Performance Analyzer remember that the tool supported Java application profiling. Over time, this feature disappeared from the radar, but since then customers have clamored for Java support in the current VTune Amplifier XE. Profiling pure Java applications and more importantly mixed Java and native C/C++ applications is becoming necessary again. In response to this request, Java profiling has been added in the new Intel(R) VTune™ Amplifier XE 2013 in addition to the JITed application profiling support.

Why does someone need Java application profiling? The main purpose of performance profiling is identifying functions or code locations which take up most of CPU’s time, and finding out how effectively they use this computing resource. Even though Java code execution is handled a Managed Runtime Environment, it can be as ineffective in terms of data management as in programs written using native languages. For example, if you’re conscious about performance of your data mining Java-application, you need to take into consideration your target platform memory architecture, cache hierarchy and latency of access to memory levels. From the platform microarchitecture point of view, profiling of a Java applications is similar to profiling native applications but with one major difference: since users want to see timing metrics against their program source code, the profiling tool must be able to map performance metrics of the binary code either compiled or interpreted by the JVM back to the original source code in Java or C/C++.

With VTune Amplifier XE Hotspot analysis you get a list of the hottest methods along with their timing metrics and call stacks. Note that a workload distribution over threads is also displayed in the time line view of results. Thread naming helps to identify where exactly the most resource consuming code was executed.

Those who are pursuing maximum performance on a platform may apply some tricks like writing and compiling performance critical modules of their Java project in native languages like C or even assembly. This way of programming helps to employ powerful CPU resources like vector computing (implemented though SIMD units and instruction sets). In this case, the heavy calculating functions become hotspots in the profiling results, which is expected as they do most of the job. However, you might be interested not only in hotspot functions, but in identifying locations in Java-code those functions were called from through a JNI-interface. Tracing such cross runtime calls in mixed language algorithm implementations could be a challenge. 

In order to help analysis of mixed code profiling results, VTune Amplifier XE is “stitching” the Java call stack with the subsequent native call stack of C/C++ functions. The reverse call stacks stitching works as well.

The most advanced usage of the tool is profiling and optimizing Java applications for the microarchitecture of the CPU utilized in your platform.  Although this may sound paradoxical because Java and JVM technology is intended to free a programmer from machine architecture specific coding, once Java code is optimized for current Intel microarchitectures it will most probably keep this advantage for future generations of CPUs. VTune Amplifier XE provides a state of the art Hardware Event-based profiling technology, which monitors hardware events in the CPU’s pipeline and can identify code pitfalls that limit most effective execution of instructions in the CPU. The hardware performance metrics are available and can be displayed against the application’s modules, functions, and Java code source lines. Hardware Event-based sampling collection with stacks is also available – it’s useful when you need to find out a call path for a function called in a driver or middleware layer in your system.

It’s fairly easy to configure your performance analysis using either the VTune Amplifier GUI or command line tool. One way is to embed your java command in a batch file or executable script. 
For example, in my run.cmd file I have the following command:
java.exe -Xcomp -Djava.library.path=mixed_dll\ia32 -cp C:\Design\Java\mixed_stacks MixedStacksTest 3 2
I just need to put a path to the run.cmd file in the Application field of the Launch Application configuration in the Project Configuration of my VTune Amplifier XE project. In addition I select “Auto” as the managed code profiling and preserve analysis of child processes with that specific switch. That’s it. Now I can start an analysis.

Similarly, you can configure an analysis in the command line tool. For example, with Hotspots analysis you can use the following command:
amplxe-cl –collect hotspots -- run.bat
or directly:
amplxe-cl –collect hotspots -- java.exe -Xcomp -Djava.library.path=mixed_dll\ia32 -cp C:\Design\Java\mixed_stacks MixedStacksTest 3 2
In case your Java application needs to run for some time or cannot be launched at the start of this analysis, on Windows* you may attach the tool to the Java process. Change the Target type selector to “Attach to Process” and add your process name or PID.

You may face some obstacles while profiling Java applications. A JVM does funny tricks with binary code and in some cases details of exact correspondence between executed instruction address and source line numbers may be distorted. As a result, we may observe a slight slipping of timing results down to the next source code lines. If it’s a loop, the time metric may slip upward. You should keep this in mind and be attentive to unlikely results.

You should expect that a JVM will interpret some rarely called methods instead of compiling them for the sake of performance. The tool marks such calls as “!Interpreter” in the restored call stack – identifying the name of an interpreted call may be a feature in future product updates. If you would like such functions to be displayed in stacks with their names, force the JVM to compile them by using the “–Xcomp” option. However, the timing characteristics may change noticeably if many small or rarely used functions are being called during execution. Note, due to inlining during the compilation stage, some functions might not appear in the stack. 

The following are some limitations:

  • It’s difficult to support all Java Runtime Environments (JRE) available in the market, so at the moment we support Oracle* Java 6 and 7. 
  • Java application profiling is supported for Hotspots analysis and Hardware Event-based analysis (e.g. Lightweight Hotspots), but Concurrency analysis is limited as some embedded Java synchronization primitives (which do not call operating system synchronization objects) cannot be recognized by the tool. As a result, some of the timing metrics may be distorted for Concurrency as well as for Locks & Waits analysis.
  • The tool cannot attach to a Java process on Linux. We support attach on Windows at the moment.
  • There are no dedicated libraries supplying a user API for collection control in the Java source code. However, you may want to try applying the native API by wrapping the __itt calls with JNI calls.
Java support feature is still developing in the product. The Java run-times and Java virtual machines are also changing with new JDK updates coming out every few months. So you may face with some problems in stack unwinding or symbols retrieve from the JVM. In those cases ask Intel support team for help with a problem. 
Below is a detailed list of tricks you might want to consider when profiling Java or mixed applications on different platforms.

Additional command line Oracle JDK Java VM options that change the behavior of the Java VM

  • On Linux x86 use client Oracle JDK Java VM instead of the server Java VM, i.e. either explicitly specify “-client” or simply do not specify “-server” as an Oracle JDK Java VM command line option.
  • On Linux x64 try specifying the ‘-XX:-UseLoopCounter’ command line option which switches off on-the-fly substitution of the interpreted method with the compiled version.
  • On Windows try specifying '-Xcomp' that forces JIT compilation for better quality of stack walking.

Note: when you force the JVM to compile initially interpreted functions, the timing of your application may change and for small and rarely called functions compilation would be less performance effective than interpretation

On Linux try to change stack unwinding mode to "After collection"
  • Click the New Analysis button in the VTune Amplifier XE tool bar
  • Choose the  ‘Hotspots’ analysis type and right-click
  • Select ‘Copy from current ’ in the context menu
  • In the opened ‘Custom Analysis’ dialog select ‘After collection’ in ‘Stack unwinding mode’ drop-down list and press ‘OK’ button
  • Start collection using this new analysis type.
*Other names and brands may be declared as the property of others.
For more complete information about compiler optimizations, see our Optimization Notice.