Finite Differences on Heterogeneous Distributed Systems

Download Zip Source Code

Here we exemplify how to expand Finite Difference (FD) computational kernels to run on distributed systems. Additionally, we describe a technique that shows how to deal with the load imbalance of heterogeneous distributed systems where different nodes or compute devices may provide distinct compute speeds. A sample source code is provided to exemplify our implementation.

Our building block is the FD compute kernels that are typically used for RTM (reverse time migration) algorithms for seismic imaging. The computations performed by the ISO-3DFD (Isotropic 3-dimensional finite difference) stencils play a major role in accurate imaging of complex subsurface structures in oil and gas surveys and exploration. Here we leverage the ISO-3DFD discussed in [1] and [2] and illustrate a simple MPI-based distributed implementation that enables a distributed ISO-3DFD compute kernel to run on a hybrid hardware configuration consisting of host Intel® Xeon® processors and attached Intel® Xeon Phi™ coprocessors. We also explore Intel® software tools that help to analyze the load balance to improve performance and scalability.

The Distributed ISO-3DFD

Our illustrative case is a 1D decomposition of the ISO-3DFD compute stencil. We set up a computational domain that is split across MPI processes. For this example, we set up one MPI process per compute device (an Intel Xeon processor or an Intel Xeon Phi coprocessor). This implementation includes the halo exchanges required between MPI processes, that is, between processors and coprocessors. When domain decomposition is applied to an FD stencil as described in [1, 2], it is necessary to implement halo exchanges between subdomains at each time-step of the algorithm. This is because the value updates at domain points close to a border of the subdomain require values computed on an adjacent subdomain:

for(int k=1; k<=HL; k++)    //Stencil Half-Length HL
          u_0 += W[k]*( 
            U0(ix+k,iy  ,iz  ) + U0(ix-k,iy  ,iz  ) +
            U0(ix  ,iy+k,iz  ) + U0(ix  ,iy-k,iz  ) +
            U0(ix  ,iy  ,iz+k) + U0(ix  ,iy  ,iz-k));

The order of the 3D stencil is defined by its half-length (HL) value: a stencil of 8th order has a half-length HL=8/2=4, for example. The “width” of the halos to be swapped between adjacent subdomains is also equal to HL.

This sample code uses a symmetric execution model: the code runs on host processors and coprocessors. This can be accomplished via fully symmetric MPI execution with distinct processes running on Intel Xeon processors and on Intel Xeon Phi coprocessors. For example, assume a single two-socket system named hostname1 with two Intel Xeon Phi coprocessor cards (named hostname1-mic0, and hostname1-mic1) attached to the system’s x16 PCIe* slots. Also assume two executable binaries rtm.cpu (complied for the processor architecture, such as Intel® Advanced Vector Extensions 2 (Intel® AVX2), and rtm.phi (compiled for the architecture of the Intel Xeon Phi coprocessor). Using the Intel® MPI Library, one can leverage both executables in a MPI+OpenMP* symmetric mode execution:

mpirun \
-n 1 -host  hostname1  -env I_MPI_PIN_DOMAIN=socket  -env OMP_NUM_THREADS=14 ./rtm.cpu : \
-n 1 -host  hostname1  -env I_MPI_PIN_DOMAIN=socket  -env OMP_NUM_THREADS=14 ./rtm.cpu : \
-n 1 -host  hostname1-mic0 –env OMP_NUM_THREADS=244 ./rtm.phi  : \
-n 1 -host  hostname1-mic1 –env OMP_NUM_THREADS=244 ./rtm.phi

The simplified single node example above assumes that both rtm.cpu and rtm.phi are parallelized via OpenMP threading as described in [1, 2]. MPI is used for data exchanges and synchronization between nodes, processors, and coprocessors. OpenMP is used to divide MPI process compute work through the cores of a given processor or coprocessor. The above example can also be expanded for multiple nodes with processors and coprocessors. See [3] for more details on MPI symmetric mode execution.

The simplified MPI harness presented here: 1) assumes that each Intel Xeon Phi coprocessor behaves like an independent compute node—no offloading programming syntax is used, and 2) allows asynchronous halo exchanges via non-blocking MPI calls. The overlapping of the compute and halo exchange with adjacent subdomains is accomplished by considering two types of compute regions on each subdomain:

  1. Local compute: Points with distance > HL from the adjacent borders. That is, points where the stencil computation relies only on values previously computed in the same subdomain.
  2. Halo compute: Points with distance <= HL from the adjacent borders. That is, points where the stencil computation relies only on values previously computed in the same subdomain.

Schematically we have:

The MPI implementation is in the style of a straightforward nearest-neighbor halo swap. First, buffers BufferPrev and BufferNext are used to post asynchronous receives for the halos needed from the neighbor subdomains:

Next, points in the Halo compute region are updated first, because these will be needed by the adjacent subdomains:

Updated values in the halo compute region are sent asynchronously to the adjacent domains: 

As the asynchronous halo exchange happens, the points in the local compute region can be updated because these computations do not depend on values from adjacent subdomains:

An MPI_Waitall synchronization call is then used to check for completion of the asynchronous halo exchanges. Finally, the values received in transfer buffers BufferPrev and BufferNext are copied to the subdomain:

The actual implementation can be found in the sample source code package attached to this article. It contains an MPI layer running on top of the ISO-3DFD code previously published in [1, 2]. Each MPI process can be tuned for performance either via hardware settings such as turbo mode, hyperthreading, and ECC mode (Intel Xeon Phi coprocessor) or tuned through software optimization and tuning such as cache blocking, thread affinitization, and data prefetching distances. Refer to the original articles for details on single process optimization and tuning.

Workload Balancing

Workload balancing is a critical part of heterogeneous systems. The distributed ISO-3DFD exemplified here has tuning parameters that permit statically balancing the amount of computation work assigned to processor sockets and Intel Xeon Phi coprocessors. It is accomplished by two command-line parameters accepted by the executables:
factor_speed_phi: Integer value representing how fast the FD code can run on the Intel Xeon Phi coprocessor.
factor_speed_xeon: Integer value representing how fast the FD code can run on the Intel Xeon processor.

The work-balance coefficient is the ratio factor_speed_phi / factor_speed_xeon that can be used to define how much work is assigned to an Intel Xeon processor and how much to an Intel Xeon Phi coprocessor. The balance between an Intel Xeon processor and an Intel Xeon Phi coprocessor can be defined at MPI launch time, allowing flexibility to support distinct Intel Xeon processors and Intel Xeon Phi coprocessor models.

The static load balancing can be easily obtained with the help of an MPI message-profiling tool. Here we use Intel® Trace Analyzer and Collector (ITAC) [4] as the message-profiling tool. One way to accomplish this is by collecting an MPI trace and analyzing the load balance:

  1. Before running the experiment, source ITAC to the runtime environment:
    . /opt/intel/itac_latest/bin/
  2. Run the MPI job with the option “-t” added to the mpirun command. This causes the Intel® MPI Library runtime to collect MPI message traces that are saved on files with extensions *.stf and *.prot.
  3. To visualize the results, use the ITAC GUI by launching the application traceanalyzer:
    Select open to access the *.stf file, and then select Continue to skip the Summary Page.
  4. In the next window, select  Charts -> Event Timeline, and then use the mouse to zoom in to the time line to be better able to visualize the communication pattern between processors and coprocessors.

The following figures show an example of load balancing on a system with two processor sockets (MPI processes P2 and P3) and four Intel Xeon Phi coprocessors (MPI processes P0, P1, P4 and P5). Each MPI process is represented by a horizontal bar in the time-line graph. The time region marked in red represents the process potentially blocked by communication/synchronization; the time region marked in blue represents computing without communication. The goal of static load balancing is to minimize or eliminate the red regions, demonstrating good balance and processor unit utilization.

For this synthetic example, the experiments are run on a dual-socket system with two Intel® Xeon® processor E5-2697 v2 and 64 GB of DDR3-1866 MHz, populated with four Intel® Xeon Phi™ coprocessor 7120 PCIe x16, each with 61 cores at 1.3 GHz, and 16 GB DDR5. We consider three sets of values for the pair factor_speed_phi and factor_speed_xeon. For the first case we set factor_speed_phi=10 and factor_speed_xeon=5, representing the assumption of ratio 10/5= 2 where one Intel Xeon Phi coprocessor computes 2x faster than one single processor socket. Assume the MPI tracing of the experiment resulted in:

The above event time line suggests that processors (MPI process ranks P2 and P3) took longer to complete their work and the coprocessors (P0, P1, P4, and P5) were blocked idle (red regions) waiting for the work to be completed on the processors. In the second case we set factor_speed_phi=15 and factor_speed_xeon=5, representing the assumption of ratio 15/5= 3 where one Intel Xeon Phi coprocessor computes 3x faster than one single processor socket. The respective MPI tracing shows:

The above event time line suggests that the coprocessors (P0, P1, P4, and P5) took longer to complete their work now 3x larger than the amount assigned to the processors (P2 and P3), causing the processors to block idle (red regions) waiting for the work to be completed.

Finally, we set factor_speed_phi=13 and factor_speed_xeon=5, representing the assumption of ratio 13/5= 2.6 where one Intel Xeon Phi coprocessor computes 2.6x faster than one single processor socket:

For this case, there is basically no noticable idle time due to load imbalance. This case represents an optimal load-balancing configuration.

Note that the above static analysis applies to the specific hardware configuration (processor model, Intel Xeon Phi coprocessor model, number of processors, memory speed, and so on). Any change in hardware configuration or algorithm implementation would require a new static analysis.

Sample Code

To exemplify the above concepts, we provide sample source code in the attached sample source code package. The reader should be able to build, run, and analyze the example. The MPI job can be set to either run mpi ranks (processes) on both Intel Xeon processor sockets and on Intel Xeon Phi coprocessor; or only on processor sockets; or only on Intel Xeon Phi coprocessor cards.

For the sake of simplicity, we do not provide or discuss the single process version of ISO-3DFD already presented and released in [1, 2]. Instead, we provide an additional MPI harness that extends the ISO-3DFD to a distributed ISO-3DFD that supports halo exchanges and load balancing between processors and coprocessors. In this way, future versions of ISO-3DFD can be replaced into this same sample code.

The only dependencies to build this simple example are the original ISO-3DFD single process compute kernel that can be found in [1,2]; and Intel® software development tools (Intel® C and C++ compilers, Intel MPI Library, and ITAC).

The README file contains intructions on how to build and run the sample code. The Makefile creates two executables: dist-iso-3dfd that runs on the processors and dist-iso-3dfd.MIC that runs on the Intel Xeon Phi coprocessor. Use the variables VERSION_MIC and VERSION_CPU to choose which version of the ISO-3DFD to use.

The script suggests how to launch both executables using the Intel MPI Library. Refer to [3] for additonal information. The script exemplifies a way to run on a system populated with two processors and two Intel Xeon Phi coprocessor cards. It can be easily expanded to run on a cluster and on different node configurations. When the script variable TRACE_OPTION=-t is set, traces of the MPI communication are collected so that one can perform the static load balancing  as described in the previous section. The static load balancing is possible because the main source code accepts the parameters factor_speed_phi and factor_speed_xeon as command-line options. Use these as described in the Workload Balancing section.

The script also supports all the command-line options required by the main executable:
nx  ny  nz  #threads  #iterations  x-blocking  y-blocking  z-blocking  factor_speed_phi  factor_speed_xeon
and additional MPI and OpenMP settings.

Note this is only an example. For optimal compiling options and runtime parameters for the ISO-3DFD compute kernel, refer to [1,2]. For performance results on single and multiple nodes refer to [5].


This article described a sample implementation for distributed FD methods, a 1D  decomposition of a ISO-3DFD stencil compute kernel. The implementation supports heterogenous systems with processors and coprocessors by supporting static loading balancing to deal with the different compute speeds of each device.

Using an MPI message-profiling tool, one is able to analyze if at each time step either 1) the Intel Xeon Phi coprocessors are waiting for processors to be completed, or 2) Intel Xeon processors are waiting for the Intel Xeon Phi coprocessors to complete.


[1] “Optimizing and Running ISO 3DFD with Support for Intel® Xeon Phi™ Coprocessor.”

[2] “Characterization and Optimization Methodology Applied to Stencil Computations,” in the book High Performance Parallelism Pearls (J. Reinders and J. Jeffers, editors).

[3] “How to run Intel MPI on Xeon Phi”

[4] Tutorial: Analyzing MPI Applications with Intel® Trace Analyzer and Intel® VTune™ Amplifier XE.

[5] “Intel® Xeon Phi™ Coprocessor Energy Application Benchmarks.”

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel, the Intel logo, Intel Xeon Phi, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

For more complete information about compiler optimizations, see our Optimization Notice.