Running Intel® Parallel Studio XE Analysis Tools on Clusters with Slurm* / srun

Since HPC applications target high performance, users are interested in analyzing the runtime performance of such applications. In order to get a representative picture of that performance / behavior, it can be important to gather analysis data at the same scale as regular production runs. Doing so however, would imply that shared memory- focused analysis types would be done on each individual node of the run in parallel. This might not be in the user’s best interest, especially since the behavior of a well-balanced MPI application should be very similar across all nodes. Therefore, users need the ability to run individual shared memory- focused analysis types on subsets of MPI- ranks or compute nodes.

 

There are multiple ways to achieve this, e.g. through

  1. Separating environments for different ranks through the MPI runtime arguments
  2. MPI library specific environments for analysis tool attachment like “gtool” for the Intel®MPI Library
  3. Batch scheduler parameters that allow separating the environments for different MPI ranks

 

In this article, we want to focus on the third option by using the Slurm* workload manager, which allows us to stay independent of the MPI library implementation being utilized.

The Slurm batch scheduler comes with a job submission utility called srun. A very simple srun job submission could look like the following:

$ srun ./my_application

Now, attaching analysis tools such as - Intel® VTune Amplifier XE, Intel® Inspector XE or Intel® Advisor XE from the Intel Parallel Studio XE tools suite – could look like the following:

$ srun amplxe-cl –c hotspots –r my_result_1 -- ./my_application

The downside of this approach, however, is that the analysis tool - VTune in this case – will be attached to each individual MPI rank. Therefore, the user will get at least as many result directories as there are shared memory nodes within the run.

If the user is only interested in analyzing a subset of MPI ranks or shared memory nodes, they can leverage the multiple program configuration from srun. Therefore, the user needs to create a separate configuration file that will define which MPI ranks will be analyzed:

$ cat > srun_config.conf << EOF
0-98    ./my_application
99      amplxe-cl –c hotspots –r my_result_2 -- ./my_application
100-255 ./my_application
EOF

As one can see from this example configuration, the user runs the target application across 256 MPI ranks, where only the 100th MPI process (i.e., rank #99) will be analyzed with VTune while all other ranks remain unaffected.

Now, the user can execute srun leveraging the created configuration file by using the following command:

$ srun --multi-prog ./srun_config.conf

This way, only one result directory for rank #99 will be created.

*Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.