Enable Intel® Software Development Tools for HPC Applications Running on Amazon EC2* Cluster

By Sunny L Gogar, Published: 04/02/2018, Last Updated: 04/02/2018

1. Introduction

This article demonstrates how to scale-out your high performance computing (HPC) application compiled with Intel® Software Development Tools to leverage Intel® Xeon® Scalable processors hosted in the Amazon Elastic Compute Cloud* (Amazon EC2*) environment. We use the Cloud Formation Cluster (CfnCluster), an open source tool published by Amazon Web Services* (AWS*), to deploy in less than 15 minutes a fully elastic HPC cluster in the cloud. Once created, the cluster provisions standard HPC tools such as schedulers, Message Passing Interface (MPI) environment, and shared storage.

This tutorial presented in this article targeted to the following audience: Application developers using Intel® C++ Compiler and/or Intel® Fortran Compiler with the Intel® MPI Library to develop their HPC applications, who want to test scaling of their application across multiple HPC nodes, and application users who want to execute their precompiled application binaries compiled with Intel® Software Development Tools in an HPC environment running in cloud and thereby increase throughput of their applications.

2. CfnCluster

CfnCluster is a framework that deploys and maintains HPC clusters on AWS.

In order to use the CfnCluster tool to set up an HPC cluster in AWS, you’ll need an AWS account and an Amazon EC2 key pair. On the local workstation, install and configure the AWS Command Line Interface (AWS CLI) and recent versions of Python* (Python 2>= 2.7.9 and Python 3>= 3.4).

The process to sign up for AWS and access the Amazon EC2 key pair is outside the scope of this article. The remainder of the article assumes the user has created an AWS account and has access to the Amazon EC2 key pair. For more information on creating an AWS account, refer to the AWS website. For additional information on creating an Amazon EC2 key pair, refer to these steps.

The following steps show you how to install and configure both AWS CLI and CfnCluster.

2.1 Install and configure AWS CLI

Assuming you have the most recent version of Python installed on your workstation, one of the ways AWS CLI can be installed is using Python’s package manager pip.

$ pip install awscli --upgrade –user

Once AWS CLI is installed, configure it using the following steps. To do this, you’ll need access to your Amazon EC2 key pair. You can also select your preferred region for launching your AWS instances. More information about AWS CLI configuration can be found here.

$ aws configure

2.2 Install and configure CfnCluster

To install the CfnCluster, use pip as follows:

$ sudo pip install --upgrade cfncluster

Once CfnCluster is installed successfully, configure it using the following command:

$ cfncluster configure

The default location of CfnCluster configuration file is ~/.cfncluster/config. This file can be customized using the editor of your choice to select cluster size, the type of scheduler, base operating system, and other parameters. Figure 1 shows an example of a CfnCluster configuration file.

aws_access_key_id = AAWSACCESSKEYEXAMPLE
aws_secret_access_key = uwjaueu3EXAMPLEKEYozExamplekeybuJuth
aws_region_name = us-east-1

[cluster Cluster4c4]
key_name = AWS_PLSE_KEY
vpc_settings = public
initial_queue_size = 4
max_queue_size = 4
compute_instance_type = c5.9xlarge
master_instance_type = c5.large
maintain_initial_size = true
scheduler = sge
placement_group = DYNAMIC
base_os = centos7

[vpc public]
vpc_id = vpc-68df1310
master_subnet_id = subnet-ba37e8f1

cluster_template = Cluster4c4
update_check = true
sanity_check = true

Figure 1. Sample CfnCluster configuration file

The configuration file shown in Figure 1 includes the following information about the cluster, which will be launched using the CfnCluster tool:

  • AWS access keys and region name for the HPC cluster
  • Cluster parameters
    • Initial queue size and max queue size. Initial queue size is the number of compute nodes that will be made available for use when the cluster is first launched. Since CfnCluster allows for provisioning an elastic cluster, the number of compute nodes can vary as per the need. Max queue size is the maximum number of compute nodes allowed for the cluster. For this tutorial we launched a four-compute-nodes cluster.
    • Master and compute instance types. The type of the AWS instances launched for the master and the compute nodes. Currently available instances types are listed on the AWS instance types webpage. For this article we select Intel Xeon Scalable processor-based Amazon EC2 C5 instances. We choose master node of instance type c5.large (2 virtual CPU -vCPU) for cluster management and running job scheduler and c5.9xlarge instances (36 vCPU) as compute nodes.
    • Scheduler. The type of job scheduler to be configured and installed for this cluster. By default, CfnCluster launches a cluster with Sun Grid Engine (SGE) scheduler. Other available options are Slurm Workload Manger (Slurm) and Torque Resource Manager (Torque).
    • Placement group. This determines how instances are placed on the underlying hardware. This is important for HPC applications where selecting the correct placement group can improve performance of applications that require the lowest latency possible. Placement group can be NONE (the default), dynamic, or a custom placement group (created by the user). Dynamic allows a unique placement group to be created and destroyed with the cluster.
  • Virtual Private Cloud (VPC) and subnets. A VPC is a virtual network dedicated to an AWS account that is logically isolated from other accounts or users. VPC ID and master subnet ID to be used for launching a CfnCluster can be referred from a user’s AWS console.

Additional information about other available options for CfnCluster parameters can be found on the CfnCluster configuration webpage.

3. Creating CfnCluster

Once the CfnCluster configuration file is verified, launch the HPC cluster as follows:

$ cfncluster create Cluster4c4

Upon successful cluster creation, the CfnCluster tool provides the required information like the master node’s public and private IP address, which can be used to access the just-launched HPC cluster. It also provides a URL to monitor the cluster utilization using Ganglia Monitoring System, an open source tool.

Status: cfncluster-Cluster4c4 - CREATE_COMPLETE                                 Output:"MasterPublicIP"=""

3.1 CfnCluster login

SSH into the master node using the public IP address. For CentOS* as the base operating system, the default user name is centos, and for Amazon Linux AMI* (Amazon Machine Image*), the default user name is ec2-user.

$ ssh centos@MasterPublicIP
Eg. $ ssh centos@

4. Executing an HPC Job on an HPC Cluster

Referring to the performance chart shown on the Intel® MPI product page, Intel MPI clearly shows significant performance improvements over the other open source Message Passing Interface (MPI) libraries like Open MPI* and MVAPICH*. This is because the Intel MPI Library, implementing the high-performance MPI-3.1 standard, focuses on enabling MPI applications to perform better for clusters based on Intel® architecture. However, by default, CfnCluster configures and installs only the open source MPI library, Open MPI. In order to get increased application performance on Amazon EC2 C5 instances based on Intel Xeon Scalable processors, we must install the required stand-alone runtime packages for the Intel MPI library, Intel Compilers, or Intel® Math Kernel Library (Intel® MKL) to execute applications compiled with Intel Compilers and the Intel MPI library.

4.1 Intel® MPI Library runtime installation

The Intel MPI Library runtime package includes everything you need to run applications based on the Intel MPI Library. It is free of charge and available to customers who already have applications enabled with the Intel MPI Library, and it includes the full install and runtime scripts. The runtime package can be downloaded from Intel registration center webpage.

Once the Intel MPI runtime package is downloaded, it can be unzipped and installed using the following commands. Here we assume that the version of Intel MPI library downloaded is 2018 update 2. By default, the Intel MPI runtime library would be installed in the /opt/intel directory. But we want the library to be shared by all the nodes in the cluster, so we will have to customize the install to change the install directory to /shared/opt/intel.

With the provisioned cluster, a shared NFS mount is created for the user and is available at /shared. This shared NFS mount is an Amazon Elastic Block Storage* (Amazon EBS*) volume for which, by default, all the content is deleted when the cluster is pulled apart. However, it is common to install frequently used HPC application software like the Intel MPI runtime library to the /shared drive and snapshot the /shared EBS volume so that the same preconfigured software can be deployed on the future clusters.

$ tar -xvzf l_mpi-rt_2018.2.199.tgz
$ cd l_mpi-rt_2018.2.199
$ sudo ./install.sh

The next step is to copy or download your precompiled MPI application package to the HPC cluster. For this tutorial we use one of the precompiled Intel MKL benchmarks, the High Performance Conjugate Gradients (HPCG) benchmark, as an example to show how to run an MPI application on the HPC cluster. A free stand-alone version of Intel MKL can be downloaded from the product page.

The HPCG Benchmark project is used to create a new metric for ranking HPC systems. It is intended as a complement to the High Performance LINPACK (HPL) benchmark, which is used to rank the TOP500 supercomputing systems. With respect to application characteristics, HPCG differs from HPL in that it not only exercises the computational power of the system, but also the data access patterns. As a result, HPCG is representative of a broad set of important applications.

4.2 Intel® Math Kernel Library - installation

$ tar -xvzf l_mkl_2018.2.199.tgz
$ cd l_mkl_2018.2.199
$ sudo ./install.sh
$ cp -r /shared/opt/intel/mkl/benchmarks/hpcg /home/centos/hpcg

If your application does not have Intel MKL dependencies, you can skip Intel MKL installation and just copy your application package to the master node. In that case, you might have to still install other runtime dependencies, for example, the Intel compiler runtime library.

$ scp -r /opt/intel/mkl/benchmarks/hpcg  centos@

4.3 Intel compiler runtime library - installation

Redistributable libraries for the 2018 Intel® C++ Compiler and Intel® Fortran Compiler for Linux* can be downloaded for free from the product page. If you have already installed Intel MKL, you can skip installing Intel compiler runtime library.

$ tar -xvzf l_comp_lib_2018.2.199_comp.cpp_redist.tgz
$ cd  l_comp_lib_2018.2.199_comp.cpp_redist
$ sudo ./install.sh

4.4 List of compute nodes

In order to run an MPI job across the cluster, we will use a list of compute nodes (hostfile or machinefile) where the HPC job will be executed. Depending on the type of scheduler, we can configure this hostfile as follows:

// For SGE scheduler
$ qconf -sel | awk -F. '{print $1}' &> hostfile

// For Slurm workload manager
$ sinfo -N | grep compute | awk '{print $1}' &> hostfile

4.5 Job submittal scripts

We also need to create a job submittal file for launching MPI jobs using the scheduler. Figure 2 shows a sample job submittal for the SGE scheduler for launching HPCG across Intel Xeon Scalable processor-based compute instances (Amazon EC2 C5 instances).

#$ -cwd
#$ -j y

# Executable

# MPI Settings
source /shared/opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
source /shared/opt/intel/impi/2018.2.199/bin64/mpivars.sh intel64 

# Fabrics Settings
export I_MPI_FABRICS=shm:tcp

# Launch MPI application
mpirun -np 4 -ppn 1 -machinefile hostfile -genv KMP_AFFINITY="granularity=fine,compact,1,0" $EXE

Figure 2. Job submittal script – SGE scheduler (job.sh).

For the Slurm workload manager, the sample job submittal script is as follows:

#SBATCH -t 00:02:00 #wall time limit

# Executable

# MPI Settings
source /shared/opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
source /shared/opt/intel/impi/2018.2.199/bin64/mpivars.sh intel64
# Fabrics Settings
export I_MPI_FABRICS=shm:tcp

# Launch MPI application
mpirun -np 4 -ppn 1 -machinefile hostfile -genv KMP_AFFINITY="granularity=fine,compact,1,0" $EXE

Figure 3. Job submittal script – Slurm scheduler (job.sh).

4.6 Launching the HPC job

Once the modifications to the job script are completed, execute the MPI application as follows:

// For SGE scheduler
$ qsub job.sh

// For Slurm workload manager
$ sbatch job.sh

4.7 Monitoring the HPC job

The HPC cluster provisioned using the CfnCluster tool also provides access to the Ganglia Monitoring System via the URL specific to your cluster. Using the Ganglia Public URL (http://<PublicIPAddress>/ganglia/), we can access the console view of the cluster to monitor utilization. Figure 4 shows the average cluster utilization for a one-hour time window.

Cluster utilization using Ganglia

Figure 4. Cluster utilization using Ganglia.

Additional information about the Ganglia Monitoring System can be found here.

5. Deleting CfnCluster

Once you log off the master instance, delete CfnCluster as follows:

$ cfncluster list           //To get list of clusters online
$ cfncluster delete Cluster4c4

However, if you would like to reuse the cluster and avoid installing the runtime libraries every time you create a cluster, it may be useful to create a snapshot of the shared drive /shared that resides on an Amazon EBS volume before deleting the cluster. More information on creating an EBS volume snapshot for cluster reusability can be found in this document published by Amazon Web Services.

6. Conclusion

This introductory article on HPC in the cloud demonstrates how the capabilities of Intel Xeon Scalable processors can be leveraged in an AWS cloud environment. We focused on how the CfnCluster tool from AWS can be configured and used to launch a four-compute-node HPC cluster. Using the steps provided in the article, HPC application developers or users can compile their applications using Intel® Parallel Studio XE on their workstations and then deploy and test the scalability of their application in the cloud using stand-alone runtime libraries (free for registered customers) and available job schedulers like SGE or Slurm. However, the application performance will vary depending on the type of compute instances selected, memory size, and the available network throughput between the compute nodes.

7. References

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804