Achieving the best performance for an HPC application requires a careful balance of a message passing interface (MPI) parallelism, threading, vectorization, memory access, and more. Intel® VTune™ Amplifier provides specialized HPC analyses to let developers start with a quick snapshot and then, if needed, get more details. Software architects tuning the performance of cloud applications will appreciate the ability to profile a running Java* process in a container.

snapshot of application performance in Intel VTune Amplifier

Figure 1

Get a Quick Performance Snapshot

Analyze MPI and non-MPI applications. (Linux* only)

The application performance snapshot features:

  • Lightweight, low overhead profiling
  • Scalable profiling detects performance variation with a large number of ranks
  • Key metrics, such as MPI and OpenMP* imbalance, low floating-point utilization, and memory stalls

Determine whether this workload will benefit from tuning by viewing all the data in one place (see Fig. 1).

Deeper Analysis with Actionable Detail

See a summary of key HPC performance attributes: MPI efficiency, threading efficiency, memory access efficiency, and floating point utilization. Then dive into the details and optimize the highest impact items first.

Use the HPC analysis to get a fast overview of critical metrics for modern hardware performance or get a more in-depth analysis for each one (see Fig. 2).

H P C performance in Intel VTune Amplifier

Figure 2

interface for MPI and OpenMP multirank analysis in Intel VTune Amplifier

Figure 3

Easier Multirank Analysis of MPI and OpenMP*

For hybrid MPI and OpenMP applications, it is important to explore OpenMP inefficiency along with MPI communication between ranks. The lower the communication spin time, the more the rank is executing, and the more impact OpenMP tuning has.

Intel VTune Amplifier can be installed on a cluster. For further tuning of MPI, use Intel® Trace Analyzer and Collector.

The list shows OpenMP regions where performance tuning can significantly reduce execution time, with the highest impact regions shown first (see Fig. 3).

Optimize Private Cloud-Based Applications

Profile enterprise applications written in Java* or in native languages like C, C++, and Fortran. Profile running Java services (like mail and daemons) without restarting the application. Popular containers that include Docker*, Mesos*, and LXC* are supported.

Intel VTune Amplifier can easily attach to an application running in a container to collect profiling data (see Fig. 4).

Diagram of Intel VTune Amplifier running an application in a container

Figure 4