Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

Introduction

The Message Passing Interface (MPI) standard is a message-passing library, a collection of routines used in distributed-memory parallel programing. This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor or coprocessor. The Intel MPI Library is a multi-fabric message passing library that implements the MPI-3.1 specification (see Table 1).

In this document, the Intel MPI Library 2017 and 2018 Beta for Linux* OS are used.

Table 1. Intel® MPI Library at a glance

Processors

Intel® processors, coprocessors, and compatibles

Languages

Natively supports C,C++, and Fortran development

Development Environments

Microsoft Visual Studio* (Windows*), Eclipse*/CDT* (Linux*)

Operating Systems

Linux and Windows

Interconnect Fabric Support

Shared memory
RDMA-capable network fabrics through DAPL* (for example, InfiniBand*, Myrinet*)
Intel® Omni-Path Architecture
Sockets (for example, TCP/IP over Ethernet, Gigabit Ethernet*) and others.

This document summarizes the steps to build and run an MPI application on an Intel® Xeon Phi™ processor x200, on an Intel® Xeon Phi™ coprocessor x200 and Intel® Xeon Phi™ coprocessor x100 natively or symmetrically. First, we introduce the Intel Xeon Phi processor x200 product family and Intel Xeon Phi processor x100 product family and the MPI programing models.

Intel® Xeon Phi™ Processor Architecture

Intel Xeon Phi processor x200 product family architecture: There are two versions of this product. The processor version is the host processor and the coprocessor version requires an Intel® Xeon® processor host. Both versions share the architecture below (see Figure 1):

  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • Up to 72 cores with 2D mesh architecture
  • Each core has two 512-bit vector processing units (VPUs) and four hardware threads
  • Each pair of cores (tile) shares 1 MB L2 cache
  • 8 or 16 GB high-bandwidth on package memory (MCDRAM)
  • 6 channels DDR4, up to 384 GB (available in the processor version only)
  • For the coprocessor, the third-generation PCIe* is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x200 architecture

Figure 1. Intel® Xeon Phi™ processor x200 architecture.

To enable the functionalities of the Intel Xeon Phi processor x200, you need to download and install the Intel Xeon Phi processor software available here.

The Intel Xeon Phi coprocessor x200 attaches to an Intel Xeon processor-based host via a third-generation PCIe interface. The coprocessor runs on a standard Linux OS. It can be used as an extension to the host (so the host can offload the workload) or as an independent compute node. The first step to bring an Intel Xeon Phi coprocessor x200 into service is to install the Intel® Manycore Platform Software Stack (Intel® MPSS) 4.x on the host, which is available here. The Intel MPSS is a collection of software including device drivers, coprocessor management utilities, and the Linux OS for the coprocessor.

Intel Xeon Phi coprocessor x100 architecture: the Intel Xeon Phi coprocessor x100 is the first-generation of the Intel Xeon Phi product family. The coprocessor attaches to an Intel Xeon processor-based host via a second-generation PCIe interface. It runs on an OS separate from the host and has the following architecture (see Figure 2):

  • Intel® Initial Many Core Instructions
  • Up to 61 cores with high-bandwidth, bidirectional ring interconnect architecture
  • Each core has a 512-bit wide VPU and four hardware threads
  • Each core has a private 512-KB L2 cache
  • 16 GB GDDR5 memory
  • The second-generation PCIe is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x100 architecture

Figure 2. Intel® Xeon Phi™ processor x100 architecture.

To bring the Intel Xeon Phi coprocessor x100 into service, you must install the Intel MPSS 3.x on the host, which can be downloaded here.

MPI Programming Models

The Intel MPI Library supports the following MPI programming models (see Figure 3):

  • Host-only model (Intel Xeon processor or Intel Xeon Phi processor): In this mode, all MPI ranks reside and execute the workload on the host CPU only (or Intel Xeon Phi processor only).
  • Offload model: In this mode, the MPI ranks reside solely on the Intel Xeon processor host. The MPI ranks use offload capabilities of the Intel® C/C++ Compiler or Intel® Fortran Compiler to offload some workloads to the coprocessors. Typically, one MPI rank is used per host, and the MPI rank offloads to the coprocessor(s).
  • Coprocessor-only model: In this native mode, the MPI ranks reside solely inside the coprocessor. The application can be launched from the coprocessor.
  • Symmetric model: In this mode, the MPI ranks reside on the host and the coprocessors. The application can be launched from the host.

 MPI programing models

Figure 3. MPI programing models.

Using the Intel® MPI Library

This section shows how to build and run an MPI application in the following configurations: on an Intel Xeon Phi processor x200, on a system with one or more Intel Xeon Phi coprocessor x200, and on a system with one or more Intel Xeon Phi coprocessor x100 (see Figure 4).

 Black and white, different configurations of the Intel® MPI Library

Figure 4. Different configurations: (a) standalone Intel® Xeon Phi™ processor x200, (b) Intel Xeon Phi coprocessor x200 connected to a system with an Intel® Xeon® processor, and (c) Intel® Xeon Phi™ coprocessor x100 connected to a system with an Intel Xeon processor.

Installing the Intel® MPI Library

The Intel MPI Library is packaged as a standalone product or as a part of the Intel® Parallel Studio XE Cluster Edition.

By default, the Intel MPI Library will be installed in the path /opt/intel/impi on the host or the Intel Xeon Phi processor. To start, follow the appropriate directions to install the latest versions of the Intel C/C++ Compiler and the Intel Fortran Compiler.

You can purchase or try the free 30-day evaluation of the Intel Parallel Studio XE from https://software.intel.com/en-us/intel-parallel-studio-xe. These instructions assume that you have the Intel MPI Library tar file - l_mpi_<version>.<package_num>.tgz. This is the latest stable release of the library at the time of writing this article. To check if a newer version exists, log into the Intel® Registration Center. The instructions below are valid for all current and subsequent releases.

As root user, untar the tar file l_mpi_<version>.<package_num>.tgz:

# tar –xzvf l_mpi_<version>.<package_num>.tgz
# cd l_mpi_<version>.<package_num>

Execute the install script on the host and follow the instructions. The installation will be placed in the default installation directory /opt/intel/impi/<version>.<package_num> assuming you are installing the library with root permission.

# ./install.sh

Compiling an MPI program

To compile an MPI program on the host or on an Intel Xeon Phi processor x200:

Before compiling a MPI program you need to establish the proper environment settings for the compiler and for the Intel MPI Library

$ source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
$ source /opt/intel/impi/<version>.<package_num>/bin64/mpivars.sh

or if you installed the Intel® Parallel Studio XE Cluster Edition, you can simply source the configuration script:

$ source /opt/intel/parallel_studio_xe_<version>/psxevars.sh intel64

Compile and link your MPI program using an appropriate compiler command:

To compile and link with the Intel MPI Library, use the appropriate commands from Table 2.

Table 2. MPI compilation Linux* command.

Programming LanguageMPI Compilation Linux* Command
Cmpiicc
C++mpiicpc
Fortran 77 / 95mpiifort

For example, to compile the C program for the host, you can use the wrapper mpiicc:

$ mpiicc ./myprogram.c –o myprogram

To compile the program for Intel Xeon Phi processor x200 and Intel Xeon Phi coprocessor x200, add the knob –xMIC-AVX512 to take advantage of the Intel AVX-512 instruction set architecture (ISA) existing on this architecture. For example, the following command compiles a C program for the Intel Xeon Phi product family x200 using the Intel AVX-512 ISA:

$ mpiicc –xMIC-AVX512 ./myprogram.c –o myprogram.knl

To compile the program for the Intel Xeon Phi coprocessor x100, add the knob –mmic. The following command show how to compile a C program for Intel Xeon Phi coprocessor x100:

$ mpiicc –mmic ./myprogram.c –o myprogram.knc

Running an MPI program on the Intel Xeon Phi processor x200

To run the application on the Intel Xeon Phi processor x200, use the script mpirun:

$ mpirun –n <# of processes> ./myprogram.knl

where n is the number of MPI processes to launch on the processor.

Running an MPI program on the Intel Xeon Phi coprocessor x200 and Intel Xeon Phi coprocessor x100

To run an application on the coprocessors, the following steps are needed:

  • Start the MPSS service if it was stopped previously:

    $ sudo systemctl start mpss

  • Transfer the MPI executable from the host to the coprocessor. For example, use the scp utility to transfer the executable (for the Intel Xeon Phi coprocessor x100) to the coprocessor named mic0:

    $ scp myprogram.knl mic0:~/myprogram.knc

  • Transfer the MPI libraries and compiler libraries to the coprocessors: before the first run of an MPI application on the Intel Xeon Phi coprocessors, we need to copy the appropriate MPI libraries, compiler libraries to the following directories on each coprocessor equipped on this system: for coprocessor x200, libraries under /lib64 directory are transferred; for coprocessor x100, libraries under /mic directory are transferred.

For example, we issue the copy to the first coprocessor x100 called mic0: the mic0 coprocessor is accessible via the IP address 172.31.1.1 as its IP address. Note that all coprocessors have unique IP addresses since they are treated as just other uniquely addressable machines. You can refer to the first coprocessor as mic0 or its IP address.

# sudo scp /opt/intel/impi/2017.3.196/mic/bin/* mic0:/bin/
# sudo scp /opt/intel/impi/2017.3.196/mic/lib/* mic0:/lib64/
# sudo scp /opt/intel/composer_xe_2017.3.196/compiler/lib/mic/* mic0:/lib64/

Instead of copying the MPI and compiler libraries manually, you can also run the script shown below, to transfer to the two coprocessor mic0 and mic1:

#!/bin/sh

export COPROCESSORS="mic0 mic1"
export BINDIR="/opt/intel/impi/2017.3.196/mic/bin"
export LIBDIR="/opt/intel/impi/2017.3.196/mic/lib"
export COMPILERLIB="/opt/intel/compilers_and_libraries_2017/linux/lib/mic"

for coprocessor in `echo $COPROCESSORS`
do
   for prog in mpiexec mpiexec.hydra pmi_proxy mpirun
   do
      sudo scp $BINDIR/$prog $coprocessor:/bin/$prog
   done

   for lib in libmpi.so.12 libmpifort.so.12 libmpicxx.so.12
   do
      sudo scp $LIBDIR/$lib $coprocessor:/lib64/$lib
   done

   for lib in libimf.so libsvml.so libintlc.so.5
   do
      sudo scp $COMPILERLIB/$lib $coprocessor:/lib64/$lib
   done
done

Script used for transferring MPI libraries to two coprocessors.

Another approach is to NFS mount the coprocessors’ file system from the host so that the coprocessors can have access to their MPI libraries from there. One advantage of using NFS mounts is that it saves RAM space on the coprocessors. The details on how to set up NFS mounts can be found in the first example in this document.

To run the application natively on the coprocessor, log in to the coprocessor and then run thempirun script:

$ ssh mic0
$ mpirun –n <# of processes> ./myprogram.knc

where n is the number of MPI processes to launch on the coprocessor.

Finally, to run an MPI program from the host (symmetrically), additional steps are needed:

Set the Intel MPI environment variable I_MPI_MIC to let the Intel MPI Library recognize the coprocessors:

$ export I_MPI_MIC=enable

Disable the firewall in the host:

$ systemctl status firewalld
$ sudo systemctl stop firewalld

For multi-card use, configure Intel MPSS peer-to-peer so that each card can ping others:

$ sudo /sbin/sysctl -w net.ipv4.ip_forward=1

If you want to get debug information, include the flags -verbose and -genv I_MPI_DEBUG=n when running the application.

The following sections include sample MPI programs written in C. The first example shows how to compile and run a program for Intel Xeon Phi processor x200 and for Intel Xeon Phi coprocessor x200. The second example shows how to compile and run a program for Intel Xeon Phi coprocessor x100.

Example 1

For illustration purposes, this example shows how to build and run an Intel MPI application in symmetric mode on a host that connects to two Intel Xeon Phi coprocessors x200. Note that the driver Intel MPSS 4.x should be installed on the host to enable the Intel Xeon Phi coprocessor x200.

In this example, use the integral presentation below to calculate Pi (π):

Image of a mathematical equation

Appendix A includes the implementation program. The workload is divided among the MPI ranks. Each rank spawns a team of OpenMP* threads, and each thread works on a chunk of the workload to take advantage of vectorization. First, compile and run this application on the Intel Xeon processor host. Since this program uses OpenMP, you need to compile the program with OpenMP libraries. Note that the Intel Parallel Studio XE 2018 is used in this example.

Set the environment variables, compile the application for the host, and then generate the optimization report on vectorization and OpenMP:

$ source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh intel64
$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o mpitest

To run two ranks on the host:

$ mpirun -host localhost -n 2 ./mpitest
Hello world: rank 0 of 2 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 2 running on knl-lb0.jf.intel.com
FROM RANK 1 - numthreads = 32
FROM RANK 0 - numthreads = 32

Elapsed time from rank 0:    8246.90 (usec)
Elapsed time from rank 1:    8423.09 (usec)
rank 0 pi=   3.141613006592

Next, compile the application for the Intel Xeon Phi coprocessor x200 and transfer the executable to the coprocessors mic0 and mic1 (assume you already set passwordless on the coprocessors).

$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -xMIC-AVX512 -o mpitest.knl
$ scp mpitest.knl mic0:~/.
$ scp mpitest.knl mic1:~/.

Enable MPI for the coprocessors and disable the firewall in the host:

$ export I_MPI_MIC=enable
$ sudo systemctl stop firewalld

This example also shows how to mount shared directory using the Network File System (NFS). As root, you mount the /opt/intel directory where the Intel C++ Compiler and Intel MPI are installed. First, add descriptors in the /etc/exports configuration file on the host to share the directory /opt/intelwith the coprocessors, whose IP addresses are 172.31.1.1 and 172.31.2.1 with read-only (ro) privilege.

[host~]# cat /etc/exports
/opt/intel 172.31.1.1(ro,async,no_root_squash)
/opt/intel 172.31.2.1(ro,async,no_root_squash)

Update the NFS export table and restart the NFS server in the host:

[host~]# exportfs –a
[host~]# service nfs restart

Next, log in on the coprocessors and create the mount point /opt/intel:

[host~]# ssh mic0
mic0:~# mkdir /opt
mic0:~# mkdir /opt/intel

 

Insert the descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1” to the /etc/fstab file in mic0:

mic0:~# cat /etc/fstab
/dev/root            /                    auto       defaults              1  1
proc                 /proc                proc       defaults              0  0
devpts               /dev/pts             devpts     mode=0620,gid=5       0  0
tmpfs                /run                 tmpfs      mode=0755,nodev,nosuid,strictatime 0  0
tmpfs                /var/volatile        tmpfs      defaults,size=85%     0  0
172.31.1.254:/opt/intel /opt/intel nfs defaults                            1  1	

Finally, mount the shared directory /opt/intel on the coprocessor:

mic0:~# mount –a

Repeat this procedure for mic1 with this descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1 1” added to the /etc/fstab file in mic1.

Make sure that mic0 and mic1 are included in the /etc/hosts file:

$ cat /etc/hosts 
127.0.0.1       localhost
::1             localhost
172.31.1.1      mic0
172.31.2.1      mic1

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 ~/mpitest.knl : -host mic1 -n 1 ~/mpitest.knl
Hello world: rank 0 of 3 running on knl-lb0
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 2 - numthreads = 272
FROM RANK 1 - numthreads = 272
Elapsed time from rank 0:   12114.05 (usec)
Elapsed time from rank 1:  136089.09 (usec)
Elapsed time from rank 2:  125049.11 (usec)
rank 0 pi=   3.141597270966

By default, the maximum number of hardware threads available on each compute node is used. However, you can change this default behavior by inserting the local environment variable –env in that compute node. For example, to set the number of OpenMP threads on mic0 to 68 and set the compact affinity, you can use the command:

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 -env OMP_NUM_THREADS=68 -env KMP_AFFINITY=compact ~/mpitest : -host mic1 -n 1 ~/mpitest
Hello world: rank 0 of 3 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 68
FROM RANK 2 - numthreads = 272
Elapsed time from rank 0:   11068.11 (usec)
Elapsed time from rank 1:   57780.98 (usec)
Elapsed time from rank 2:  133417.13 (usec)
rank 0 pi=   3.141597270966

To simplify the launch process, define a file with all machine names, name all the executables, and then move them to a predefined directory. For example, all executables are named mpitest and are located in user home directories:

$ cat hosts_file
knl-lb0:1
mic0:2
mic1:2

$ mpirun -machinefile hosts_file -n 5 ~/mpitest
Hello world: rank 0 of 5 running on knl-lb0
Hello world: rank 1 of 5 running on mic0
Hello world: rank 2 of 5 running on mic0
Hello world: rank 3 of 5 running on mic1
Hello world: rank 4 of 5 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 136
FROM RANK 3 - numthreads = 136
FROM RANK 2 - numthreads = 136
FROM RANK 4 - numthreads = 136
Elapsed time from rank 0:   11260.03 (usec)
Elapsed time from rank 1:   71480.04 (usec)
Elapsed time from rank 2:   69352.15 (usec)
Elapsed time from rank 3:   74187.99 (usec)
Elapsed time from rank 4:   67718.98 (usec)
rank 0 pi=   3.141598224640

 

Example 2

Example 2 shows how to build and run an MPI application in symmetric model on a host that connects to two Intel Xeon Phi coprocessors x100. Note that the driver Intel MPSS 3.x should be installed for the Intel Xeon Phi coprocessor x100.

The sample program estimates the calculation of Pi (π) using a Monte Carlo method. Consider a sphere centered at the origin and circumscribed by a cube. The sphere’s radius is r and the cube edge length is 2r. The volumes of a sphere and a cube are given by

Image of a mathematical equation

The first octant of the coordinate system contains one eighth of the volumes of both the sphere and the cube; the volumes in that octant are given by:

Image of a mathematical equation

If we generate Nc points uniformly and randomly in the cube within this octant, we expect that about Ns points will be inside the sphere’s volume according to the following ratio:

Image of a mathematical equation

Therefore, the estimated Pi (π) is calculated by

Image of a mathematical equation

where Nc is the number of points generated in the portion of the cube residing in the first octant, and Ns is the total number of points found inside the portion of the sphere residing in the first octant.

In the implementation, rank 0 (process) is responsible for dividing the work among the other n ranks. Each rank is assigned a chunk of work, and the summation is used to estimate the number Pi. Rank 0 divides the x-axis into n equal segments. Each rank generates (Nc /n) points in the assigned segment, and then computes the number of points in the first octant of the sphere (see Figure 5).

Image of a mathematical results

Figure 5. Each MPI rank handles a different portion in the first octant.

The pseudo code is shown below:

Rank 0 generate n random seed
Rank 0 broadcast all random seeds to n rank
For each rank i [0, n-1]
receive the corresponding seed
set num_inside = 0
For j=0 to Nc / n
generate a point with coordinates 
x between [i/n, (i+1)/n] 
y between [0, 1]
z between [0, 1] 
			compute the distance d = x^2 + y^2 + z^2
			if distance d <= 1, increment num_inside
		Send num_inside back to rank 0
	Rank 0 set Ns  to the sum of all num_inside
	Rank 0 compute Pi = 6 * Ns  / Nc

In order to build the application montecarlo.knc for the Intel Xeon Phi coprocessors x100, the Intel C++ Compiler 2017 is used. Appendix B includes the implementation program. Note that this example just simply shows how to run the code on an Intel Xeon Phi coprocessor x100. You can optimize the sample code for further improvement.

$ source /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
$ mpiicc –mmic montecarlo.c -o montecarlo.knc

Build the application for the host:

$ mpiicc montecarlo.c -o montecarlo

Transfer the application montecarlo.knc to the /tmp directory on the coprocessors using the scp utility. In this example, we issue the copy to two Intel Xeon Phi coprocessors x100.

$ scp ./montecarlo.knc mic0:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 $ scp ./montecarlo.knc mic1:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 

Transfer the MPI libraries and compiler libraries to the coprocessors using the script in Figure 5. Enable the MPI communication between host and Intel Xeon Phi coprocessors x100:

$ export I_MPI_MIC=enable

Run the mpirun script to start the application. The flag –n specifies the number of MPI processes and the flag –host specifies the machine name:

$ mpirun –n <# of processes> -host <hostname> <application>

We can run the application on multiple hosts by separating them with “:”. The first MPI rank (rank 0) always starts on the first part of the command:

$ mpirun –n <# of processes> -host <hostname1> <application> : –n <# of processes> -host <hostname2> <application>

This starts the rank 0 on hostname1 and other ranks on hostname2.

Now run the application on the host. The mpirun command shown below starts the application with 2 ranks on the host, 3 ranks on the coprocessor mic0, and 5 ranks on coprocessor mic1:

$ mpirun -n 2 -host localhost ./montecarlo : -n 3 -host mic0 /tmp/montecarlo.knc \
: -n 5 -host mic1 /tmp/montecarlo.knc

Hello world: rank 0 of 10 running on knc0
Hello world: rank 1 of 10 running on knc0
Hello world: rank 2 of 10 running on knc0-mic0
Hello world: rank 3 of 10 running on knc0-mic0
Hello world: rank 4 of 10 running on knc0-mic0
Hello world: rank 5 of 10 running on knc0-mic1
Hello world: rank 6 of 10 running on knc0-mic1
Hello world: rank 7 of 10 running on knc0-mic1
Hello world: rank 8 of 10 running on knc0-mic1
Hello world: rank 9 of 10 running on knc0-mic1
Elapsed time from rank 0:      13.87 (sec)
Elapsed time from rank 1:      14.01 (sec)
Elapsed time from rank 2:     195.16 (sec)
Elapsed time from rank 3:     195.17 (sec)
Elapsed time from rank 4:     195.39 (sec)
Elapsed time from rank 5:     195.07 (sec)
Elapsed time from rank 6:     194.98 (sec)
Elapsed time from rank 7:     223.32 (sec)
Elapsed time from rank 8:     194.22 (sec)
Elapsed time from rank 9:     193.70 (sec)
Out of 4294967295 points, there are 2248849344 points inside the sphere => pi=  3.141606330872

A shorthand way of doing this in symmetric mode is to use the –machinefile option for the mpirun command in coordination with the I_MPI_MIC_POSTFIX environment variable. In this case, make sure all executables are in the same location on the host and mic0 and mic1 cards.

The I_MPI_MIC_POSTFIX environment variable simply tells the library to add the .mic postfix when running on the cards (since the executables there are called montecarlo.knc).

$ export I_MPI_MIC_POSTFIX=.knc

Now set the rank mapping in your hosts file (by using the <host>:<#_ranks> format):

$ cat hosts_file
localhost:2
mic0:3
mic1:5

And run your executable:

$ mpirun -machinefile hosts_file /tmp/montecarlo

The nice thing about this syntax is that you only have to edit the hosts_file when deciding to change your number of ranks or need to add more cards.

As an alternative, you can ssh to a coprocessor and launch the application from there:

S ssh mic0
S mpirun -n 3 /tmp/montecarlo.knc
Hello world: rank 0 of 3 running on knc0-mic0
Hello world: rank 1 of 3 running on knc0-mic0
Hello world: rank 2 of 3 running on knc0-mic0
Elapsed time from rank 0:     650.47 (sec)
Elapsed time from rank 1:     650.61 (sec)
Elapsed time from rank 2:     648.01 (sec)
Out of 4294967295 points, there are 2248795855 points inside the sphere => pi=  3.141531467438

 

Summary

This document showed you how to compile and run simple MPI applications in symmetric model. In a heterogeneous computing system, the performance in each computational unit is different and this system behavior leads to the load imbalance problem. The Intel® Trace Analyzer and Collector can be used to analyze and understand the behavior of a complex MPI program running on a heterogeneous system. Using the Intel Trace Analyzer and Collector, you can quickly identify bottlenecks, evaluate load balancing, analyze performance, and identify communication hotspots. This powerful tool is essential for debugging and improving the performance of a MPI program running on a cluster with multiple computational units. For more details on using the Intel Trace Analyzer and Collector, read the whitepaper “Understanding MPI Load Imbalance with Intel® Trace Analyzer and Collector” available on /mic-developer. For more details, tips and tricks, and known workarounds, visit our Intel® Cluster Tools and the Intel® Xeon Phi™ Coprocessors page.

References

Appendix A

The code of the first sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 1.0)
//      Calculate the number PI using its integral representation.
//
//******************************************************************************
#include <stdio.h>
#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 1 
#define TAG_TIME 2 

const long ITER = 1024 * 1024;
const long SCALE = 16;
const long NUM_STEP = ITER * SCALE;

float calculate_partialPI(int n, int num) {
   unsigned long i;
   int  numthreads;
   float x, dx, pi = 0.0f;

   #pragma omp parallel
   #pragma omp master
   {
      numthreads = omp_get_num_threads();
      printf("FROM RANK %d - numthreads = %d\n", n, numthreads);
   }

   dx = 1.0 / NUM_STEP;

   unsigned long NUM_STEP1 = NUM_STEP / num;
   unsigned long begin = n * NUM_STEP1;
   unsigned long end = (n + 1) * NUM_STEP1;
   #pragma omp parallel for reduction(+:pi)   
   for (i = begin; i < end; i++)
   {
      x = (i + 0.5f) / NUM_STEP;
      pi += (4.0f * dx) / (1.0f + x*x);
   }

   return pi;
}

int main(int argc, char **argv)
{
   float pi1, total_pi;
   double startprocess;
   int i, id, remote_id, num_procs, namelen; 
   char name[MPI_MAX_PROCESSOR_NAME]; 
   MPI_Status stat; 

   if (MPI_Init (&argc, &argv) != MPI_SUCCESS) 
   { 
      printf ("Failed to initialize MPI\n"); 
      return (-1); 
   } 
   
   // Create the communicator, and retrieve the number of processes. 
   MPI_Comm_size (MPI_COMM_WORLD, &num_procs); 
  
   // Determine the rank of the process. 
   MPI_Comm_rank (MPI_COMM_WORLD, &id); 
   
   // Get machine name 
   MPI_Get_processor_name (name, &namelen); 

   if (id == MASTER) 
   { 
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name); 
   
      for (i = 1; i<num_procs; i++)  
      {    
         MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);    
         MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);        
         MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);          
         MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat); 
               
         printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name); 
      }    
   } 
   else   
   {        
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD); 
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD); 
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD); 
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD); 
   }

   startprocess = MPI_Wtime();

   pi1 = calculate_partialPI(id, num_procs);

   double elapsed = MPI_Wtime() - startprocess;

   MPI_Reduce (&pi1, &total_pi, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
   if (id == MASTER) 
   { 
      double timeprocess[num_procs]; 
   
      timeprocess[MASTER] = elapsed; 
      printf("Elapsed time from rank %d: %10.2f (usec)\n", MASTER, 1000000 * timeprocess[MASTER]); 
      
      for (i = 1; i < num_procs; i++) 
      { 
         // Rank 0 waits for elapsed time value  
         MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);  
         printf("Elapsed time from rank %d: %10.2f (usec)\n", i, 1000000 *timeprocess[i]); 
      }

      printf("rank %d pi= %16.12f\n", id, total_pi); 
   } 
   else 
   { 
      // Send back the processing time (in second) 
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD); 
   } 

   // Terminate MPI. 
   MPI_Finalize(); 
   return 0; 
}

 

Appendix B

The code of the second sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 0.5)
//      Based on a Monto Carlo method, this MPI sample code uses volumes to
//      estimate the number PI.
//      
//******************************************************************************
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <math.h>

#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 4
#define TAG_TEST 5
#define TAG_TIME 6

int main(int argc, char *argv[])
{
  int i, id, remote_id, num_procs;
   
  MPI_Status stat;
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];

  // Start MPI.
  if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
    {
      printf ("Failed to initialize MPI\n");
      return (-1);
    }

  // Create the communicator, and retrieve the number of processes.
  MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

  // Determine the rank of the process.
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
    // Get machine name
  MPI_Get_processor_name (name, &namelen);
  
  if (id == MASTER)
    {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++) 
	{	
	  MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);	
	  MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);  		
	  MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);			
	  MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
			
	  printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
	}
    }
  else   
    {	    
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
    }
  
  // Rank 0 distributes seek randomly to all processes.
  double startprocess, endprocess;

  int distributed_seed = 0;
  int *buff;

  buff = (int *)malloc(num_procs * sizeof(int));
	
  unsigned int MAX_NUM_POINTS = pow (2,32) - 1;
  unsigned int num_local_points = MAX_NUM_POINTS / num_procs;

  if (id == MASTER)
    {		  
      srand (time(NULL));
  
      for (i=0; i<num_procs; i++)    
	{           
	  distributed_seed = rand();
	  buff[i] = distributed_seed;
	}
    }

  // Broadcast the seed to all processes
  MPI_Bcast(buff, num_procs, MPI_INT, MASTER, MPI_COMM_WORLD);

  // At this point, every process (including rank 0) has a different seed. Using their seed,
  // each process generates N points randomly in the interval [1/n, 1, 1]
  startprocess = MPI_Wtime();

  srand (buff[id]);

  unsigned int point = 0;
  unsigned int rand_MAX = 128000;
  float p_x, p_y, p_z;
  float temp, temp2, pi;
  double result;
  unsigned int inside = 0, total_inside = 0;
    for (point=0; point<num_local_points; point++)
    {
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;
      
      temp2 = (float)id / num_procs;	// id belongs to 0, num_procs-1
      p_x += temp2;
      
      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;
      
      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;

      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;

      if (result <= 1)
	  {
		inside++;
	  }
    }

  double elapsed = MPI_Wtime() - startprocess;

  MPI_Reduce (&inside, &total_inside, 1, MPI_UNSIGNED, MPI_SUM, MASTER, MPI_COMM_WORLD);

#if DEBUG 
  printf ("rank %d counts %u points inside the sphere\n", id, inside);
#endif

  if (id == MASTER)
    {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (sec) \n", MASTER, timeprocess[MASTER]);

      for (i=1; i<num_procs; i++)
	{
	  // Rank 0 waits for elapsed time value 
	  MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat); 
	  printf("Elapsed time from rank %d: %10.2f (sec) \n", i, timeprocess[i]);
	}

      temp = 6 * (float)total_inside;
      pi = temp / MAX_NUM_POINTS;   
      printf ( "Out of %u points, there are %u points inside the sphere => pi=%16.12f\n", MAX_NUM_POINTS, total_inside, pi);
    }
  else
    {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
    }

  free(buff);

  // Terminate MPI.
  MPI_Finalize();
  
  return 0;
}
For more complete information about compiler optimizations, see our Optimization Notice.
AttachmentSize
Package icon code.zip4.45 KB