The purpose of this blog (from: Phil Thierry and Leo Borges  part of the Intel SSG Energy Engineering Team and covering Oil and Gas technical activities ) is to start a discussion about implementation for 3D seismic wave propagation and. more specifically, for the Reverse Time Migration (RTM) algorithm as a whole on one, two or several Intel® Xeon Phi™ coprocessors in a full hybrid mode, i.e. including full utilization of the CPU cores.
We will start by raising a number of questions around this topic, which are important to consider in the context of what we should approach, and how we approach these problems from a software perspective. Beyond the theoretical context, we share some thoughts specific to Fine Different Stencils and include code samples.
Premise
The subject of Finite Difference Stencils is quite large and the implementation choices are almost infinite. We may open some subblogs later as the audience grows interest on a few of these topics and new subject are raised, we invite the discussion to also expand to topics beyond specific architectures. Things we are thinking about are these....what is of key interest to you?
1. What is the impact of the wave equation approximation on the application behavior, from isotropic constant density, variable density, vertical transverse isotropy, titled transverse isotropy, elastic waves, etc…
2. Which i/o scheme do we consider for snapshot? Do we really need explicit i/o?
3. If we consider forward / backward propagation on the Intel Xeon Phi coprocessor, how can we handle the snapshot? Checkpointing in memory won’t definitively be a solution but how to achieve fast asynchronous i/o? Do we need local storage connection to PCIe or global storage on PCIe/IB?
4. The choice between time or frequency domain: what should be considered from the Intel® Xeon Phi(tm) Coprocessor point of view with respect to numerical scheme?
5. Should we use domain decomposition or can we handle one or more shot per SMP node even when thinking about elastic modeling or frequency content increase? What will be the resulting impact on the numerical scheme?
6. Will we have enough memory to send one or more shot to Intel Xeon Phi coprocessor to compute the whole RTM? Or do we need to stick to a traditional offload model?
7. What should be the stencil length from a computing point of view regarding? Could we automatically find an optimal stencil order satisfying both geophysical and computer criteria?
8. What are the reasons to implement higher orders in time?
9. Is there any benefit to consider new programming model as Cilk Plus and TBB compared to standard C, FORTRAN, MPI and OMP? Would it be beneficial to expand the similarities and contrasts analysis in studies like here?
10. Will the full hybrid heterogeneous port absolutely need a dynamic load balancing to simultaneously handle differences in device performance and different shot sizes?
11. Will the behavior observed on CPU cores remain the same on the coprocessor?
12. What will be the impact of (no) vectorization, data alignment on the coprocessor compared to the impact on CPU counterpart implementations? What techniques, and under which scenarios would they have the highest impact?
13. Is there any advantage or need for hardware support of transcendental functions on the coprocessor?
14. How can we characterize and model the impact of FMA instructions?
15. What are the best known techniques to maximize overlapping of computing and data transfer? How to maximize utilize data persistency and coprocessortocoprocessor transfers?
16. Should we also open another post(s) about other seismic algorithms such as as wavefield extrapolation migration, Kirchhoff PsDM, Tomography, and Reservoir simulation?
17. Power measurement is a real issue. TTI 3DFD , 8th order exhibits about 1000+ pJ/Ops , while the Exaflops target is about 20 pJ/Ops. Of course a large part of the needed 60x factor will come from the hardware, but getting the software as close as possible from the achievable peak will greatly help. So what can we expect? Standard implementation without any intrinsic programming can reach 40% of the CPU peak of a fully loaded node but what can be achieved by the counterpart on the coprocessor? Can we still solely rely on the compiler to get such performance and maintain the ease of use of high level programming models?
18. ….
We should of course take advantage of this blog to add more bullets to this list
We invite the readers to help narrow down the subject to initiate the discussion:
Thanks to time domain implementation of RTM, we propose starting with the FD kernel even without halo exchange or boundary conditions.
We can consider an offload mode to simulate offload of the space derivative for every time step or we can play with a native mode to simulate a “one or more shot per coprocessor”.
In terms of approximation of the wave equation, we can talk about isotropic case which is known to be CPU bandwidth bounded (without cache blocking) or we can directly consider a TTI implementation which is not CPU bandwidth bounded (even without cache blocking).
In terms of performance results, since Megapoint or Megacell per seconds do not really give any concrete information when comparing distinct implementations, we propose adopting either Flops/sec or % of peak as the performance metric. Another advantage of such metric is that number of Flops per grid points is only need to calculate the overall performance. The final aggregated Flops/sec performance metric maintains the Flops/point number confidential.
The dialogue
Here is a capture from a dialog we had with a partner company when we posed our questions initially. They answered, and we commented:
Q1: from isotropic constant density, variable density, vertical transverse isotropy, titled transverse isotropy, elastic waves, etc…
A1 (Partner): Increasing the complexity of the media being modeled (from isotropic constant density up to elastic anisotropic) a) increases the memory requirements due to extra Earth models and more wavefield volumes (depending on the formulation) b) increases the computations per grid point c) changes the compute / memory ratio, again depending on formulation. These changes have cascading effects – for example increasing the memory requirement may increase the number of nodes required to solve a given problem, thus increasing the communication bandwidth required. The choice of which media type to model is ultimately a user choice, and dependent on the geological situation they are working in. For example, the presence of TTI anisotropy can result in 100s of meters difference in imaging position compared to Isotropic.
A1.2 (Intel): You're right. The most important is probably the balance between MUL and ADD which leads to the level of efficiency of such codes. We already demonstrated that viscoelastic and TTI formulations are more efficient than isotropic due to the MUL/ADD balance and to the lower bytes/flops requirement, especially when using cache blocking. Memory footprint and can be partly handled by domain decomposition for production runs.
Then remains the question of Intel Xeon Phi coprocessor implementation: "being offloaded or not". Offloading the hot spot for every domain computed on the host looks like some kind of standard "accelerated" procedure once we can totally overlap the data transfer.
Q2: Which i/o scheme do we consider for snapshot? Do we really need explicit i/o?
A2 (Partner): I am not too sure what you mean by explicit I/O. I assume that you are comparing accessing the file system from offloaded code/native execution to transferring the data back to the host and have it write to disk. This will depend on performance. If one is much more convenient but slightly slower, it might still be good enough. This will depend on propagation vs I/O performance.
A2.1 (Intel) I mentioned i/o regarding the way we may handle the snapshots to achieve the time reversal procedure. We need to get back to the difficult choice for the best i:o scheme that we all had on standard cluster. The question of storing on disk all or part of the wavefields, storing only boundaries or keeping in memory using checkpointing techniques is now coming back on a IA device which can share the file system, which has a limited amount of memory, and which can achieve asynchronous transfer. Then we're back to the Shakespeare' question : "being offloaded or not".
Q5: Should we use domain decomposition or can we handle one or more shot per SMP node even when thinking about elastic modeling or frequency content increase? What will be the resulting impact on the numerical scheme?
A5 (Partner): Domain decomposition is necessary for production 3D because of the size of the datasets and desired frequency content (especially with TTI).
Q6: Will we have enough memory to send one or more shot to Intel Xeon Phi coprocessor to compute the whole RTM? Or do we need to stick to a traditional offload model?
A6 (Partner): See A5. Only low frequency and/or 2D migrations are small enough so that more than one can fit on a card.
A5.2 & A6.2 (Intel): We need to consider both the "dataset" as prestack seismic input data and the "image size" which is a 3D table to be multiplied by the number of parameters and fields (velocities, density, aniso, wavefields, ..) The Dataset size is only related to the spacing between shot and receivers. The number of seismic traces will continue to grow but will the acquisition surface follow this trend ? Anyhow these input data will soon reached the 50 PB.
To achieve better frequency content recovery, and apart from the input signal and the receiver capability, we will see a decrease of the space and time sampling, leading to 3D field size increase together with an increase of the number of fields when going to fully elastic propagation.
But following what has been done is seismology (thanks to lower frequency and larger space sampling), don't you think that variable grid spacing will allow SMP implementation on the node even in case of elastic RTM ? Okay .. here i need to provide you with numbers ;) .
A many core handling its own shot gather would be nice to have with the same code running on any host or Intel Xeon Phi coprocessor with an adequate dynamic load balancing. Are we able to evaluate the needed memory size and bdw to achieve this in the coming years ?
The question may be totally different for FWI where each shot will contribute to gradient update for each iteration, gradient that will need to be available for every other shots.
Q13: Is there any advantage or need for hardware support of transcendental functions on the coprocessor?
A13 (Partner): Yes, TTI is compute bound and it has a lot of trigonometry.
A13.1 (Intel): Correct but Intel Xeon Phi coprocessor doesn't include sin/cos at the moment , only exp. We're trying to characterize the balance between "sin/cos on the fly with or without VML ; precomputed table requiring data transfert ; use of exp to get transcendental sin/cos". TTI RTM needs sin/cos of anisotropy parameters which are 3D table. Within WEM we also need to compute a lot of sin/cos for phase velocity that didn't require 3D storage which leads to totally different strategy.
Q14: How can we characterize and model the impact of FMA instructions?
A14 (Partner): This question will likely benefit from having detailed knowledge of the arithmetic pipeline, and which Intel is in the best position to provide answers.
A first evaluation can be done using the compiler 12.1 and the SDE freely available. You can generate AVX2/FMA instructions mix by emulate your application running on Haswell. The point is then to be able to formalize the problem and to predict future performance based on this instructions count !

Finite Difference Stencils  Part 1
Here we talk about Finite Difference stencils and invite you to reply with your own observations and experiences on this topic.
More specifically, we hope to use this article to start discussion on the implementation of multi dimensional finite difference stencils with focus on both incore optimizations and parallel programming. An ISO3DFD implementation is used as example.
The computation to be used as example is a simple constant symmetric 8thorder 3Dimensional Finite Difference (3DFD) with an Isotropic (ISO) update on time. This 25point stencil computation can schematically be written as a time step outer loop t=1..nt where each iteration updates the n1Xn2Xn3 3dimensional array next( , , ) using constant coefficients coeff(0), coeff(1), …, coeff(4) and values from the n1Xn2Xn3 arrays prev( , , ) and vel( , , ). In a pseudocode and simple implementation:
for t=1..nt // nt time steps for i3=4..n34 // Third dimension in space for i2=4..n24 // Second dimension in space for i1=4..n14 // First dimension in space div = coeff(0)*prev(i1,i2,i3) for r=1..4 // 8th order stencil div += coeff(r)*( prev(i1+r,i2 ,i3 ) + prev(i1−r,i2 ,i3 ) // Space first dimension +prev(i1 ,i2+r,i3 ) + prev(i1 ,i2−r,i3 ) // Space second dimension +prev(i1 ,i2 ,i3+r) + prev(i1 ,i2 ,i3r) // Space third dimension done next(i1,i2,i3) = 2*prev(i1,i2,i3)  next(i1,i2,i3) + div*vel(i1,i2,i3) // time update done done done swap prev <> next done 
Here n1 corresponds to the fastest dimension (unity stride) and n3 is the slowest dimension (stride length n1*n2). After (loosely) defining the problem, let’s also choose our performance metrics. Finite difference schemes are typically measured in terms of either walltime per iteration (sec), number of cells next(i1,i2,i3) processed per second (Cell/s), or floatingpoint operations per second (Flops). Here we give preference to use Flops as the metric since it allows comparing different implementations of more sophisticated stencil schemes. In the ISO3DFD example, the number of floatingpoint operations per grid cell at a given time iteration is 7*R+5 = R+3 (multiplications) plus (6*R+2) additions, where R=4 is called the stencil half length. Converting between Flops and Cell/s is straightforward: Flops = (7*R+5) * Cell/s.
The focus on this initial post will be on parallelizing the workload. For additional background information, refer to <here> for example. On this first post, we omitted any discussion about techniques to improve vectorization of the finite differences.
The parallelization approach is driven by our choice for data blocking: data is partitioned across cores/threads available in the system by defining thread blocks of size n1_Tblock x n2_Tblock x n3_Tblock. We break the n1 X n2 X n3 domain down to a list of indexes describing n1_Tblock X n2_Tblock X n3_Tblock blocks. The list has length num_blocks=num_n1_blocks * num_n2_blocks * num_n3_blocks = ceiling((n12*R)/n1_Tblock) * ceiling((n22*R)/n2_Tblock) * ceiling((n32*R)/n3_Tblock) and is padded so that each block description is aligned at cacheline boundaries:
struct block_struct{ int i1_idx; int i2_idx; int i3_idx; int padding [10]; // 64B cacheline padding };
__declspec(align(64)) block_struct blocking[num_blocks+1]; // one entry per cacheline int index=0; for (int i3b=4; i3b<n34; i3b+=n3_Tblock) for(int i2b=4; i2b<n24; i2b+=n2_Tblock) for(int i1b=4; i1b<n14; i1b+=n1_Tblock) { blocking[index].i1_idx = i1b; blocking[index].i2_idx = i2b; blocking[index].i3_idx = i3b; index++; } 
First note that the explicit list of blocks allows the programmer to define any order in which the threading blocks are listed and assigned to threads  by changing the order of the loops, for example. We will not explore this reordering here. Second, the same framework can be used in different threading models (OpenMP, Cilk Plus, TBB, Pthreads). Here we use OpenMP:
#pragma omp parallel for num_threads(num_threads) schedule(dynamic) \ firstprivate (n1, n2, n3, num_blocks, n1_Tblock, n2_Tblock, n3_Tblock) \ shared( coeff, prev, next, vel, blocking) for (int i=0; i<num_blocks; i++) { int i1b = blocking[i].i1_idx; int i2b = blocking[i].i2_idx; int i3b = blocking[i].i3_idx; apply_stencil(next, prev, vel, coeff, i1b, i2b, i3b, n1, n2, n3, n1_Tblock, n2_Tblock, n3_Tblock); } 
We use dynamic scheduling to guarantee throughput of the work (routine calls) given that num_blocks can be considerably larger than num_threads when using small values for n*_Tblock blocks.
Adopting good block sizes requires testing. Our choice reflects estimates in the literature for stencil loop tiling, like in Leopold’s “Tight Bounds on Capacity Misses for 3D Stencil Codes” (2002) and Datta et all “Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors” (2009). Leopold’s work suggests rectangular tiles of shape (N2) x s x (s*L/2) for an NxNxN domain, where L is the cacheline size, and s is the blocking factor. Assuming this tiling scheme, we can expand the idea and make an initial estimate for the blocking factor. The limiting factor in our stencil computation is to keep data locality on memory reads of offset entries prev(i1,i2±r,i3±r). To keep locality in the KNC architecture with 512KB L2 cache:
n1 * s * s * (16 (SP per cacheline) / 2) * 4 (SP bytes on array prev) <= 512 KB (L2 cache)
Assuming the array size n1 in unit stride dimension in the order of n1=1K:
s^2 <= 512K/(1K * 8 * 4) , which gives: s <= 4
That is, an initial estimate for our tiling is n1_Tblock=n1, n2_Tblock=4, and n3_Tblock=4*16/2=32. This also implies no blocking in the unitstride direction, and thus preserving long contiguous memory read streams on this particular choice. At first it might seem like a too ‘narrow’ blocking size n2_Tblock=4 in the space second dimension. But recall that computations will also load halo cells prev(i1,i2±r,i3), r<= R=4, so the effective blocking size in the second dimension is n2_Tblock+2*R.
Note that on this initial version: 1) We don’t implement any time blocking/time skewing scheme. There is no code dedicated to exploit either interior cell property or the fact that the entire problem might be solved on a single SMP system. 2) Also there is no looplevel blocking: the tiling happens only at ‘threadblocking’ level when work is assigned to threads via the blocking[*] list of indexes. One may consider implementing multilevel blocking to attempt finer control of datatocore assignment: a threadblocking using slightly larger blocks and then loop tiling when applying the actual stencil computation within each thread. 3) We don’t take into consideration hosttocoprocessor and coprocessortohost transfer times. Only stencil compute time on the Intel® Xeon Phi™ coprocessor.
Additionally, for this first post we refrain from extensive code changes to experiment with vectorization of the finite differences. A C/C++ version of the routine without specific optimizations may basically look like:
// apply 8th order ISO stencil on block [i1b..i1b+n1_Tb]X[i2b..i2b+n2_Tb]X[i3b..i3b+n3_Tb] void apply_stencil(float *ptr_next, float *ptr_prev, float *ptr_vel, float *coeff, // arrays const int i1b, const int i2b, const int i3b, // block indexes const int n1, const int n2, const int n3, // full domain const int n1_Tb, const int n2_Tb, const int n3_Tb) { // block sizes
const int n1_end = n14, n2_end = n24, n3_end = n34; const float c0 = coeff[0], c1=coeff[1], c2=coeff[2], c3=coeff[3], c4=coeff[4]; const int n1n2 = n1*n2; const int n1_2 = 2*n1, n1n2_2 = 2*n1n2; const int n1_3 = 3*n1, n1n2_3 = 3*n1n2; const int n1_4 = 4*n1, n1n2_4 = 4*n1n2;
const int n3b_end = MIN(i3b+n3_Tb, n3_end); const int n2b_end = MIN(i2b+n2_Tb, n2_end); const int n1b_end = MIN(i1b+n1_Tb, n1_end);
for (int i3=i3b; i3< n3b_end; i3++) { float *prev = &ptr_prev[i3*n1n2+i2b*n1]; float *next = &ptr_next[i3*n1n2+i2b*n1]; float *vel = &ptr_vel [i3*n1n2+i2b*n1]; for(int i2=i2b; i2< n2b_end; i2++, prev+=n1, next+=n1, vel+=n1) {
#pragma vector always #pragma ivdep for(int i1=i1b; i1<n1b_end; i1++) { float tmp = c0* prev[i1] + c1*( prev[i1 +1] + prev[i1 1] + prev[i1 +n1] + prev[i1 n1] + prev[i1 +n1n2] + prev[i1 n1n2] ) + c2*( prev[i1 +2] + prev[i1 2] + prev[i1 +n1_2] + prev[i1 n1_2] + prev[i1+n1n2_2] + prev[i1n1n2_2] ) + c3*( prev[i1 +3] + prev[i1 3] + prev[i1 +n1_3] + prev[i1 n1_3] + prev[i1+n1n2_3] + prev[i1n1n2_3] ) + c4*( prev[i1 +4] + prev[i1 4] + prev[i1 +n1_4] + prev[i1 n1_4] + prev[i1+n1n2_4] + prev[i1n1n2_4] ); next[i1] = 2.0f*prev[i1] next[i1] +tmp*vel[i1]; } } } } 
So, it’s a simple implementation that could alternatively be in Fortran, for example. Here we do not explore data alignment properties at source code level, neither experiment with code changes that aim vectorization, more efficient memory access, etc…
With the above test harness, we can run our first experiments. We process an n1=928 x n2=448 x n3=840 problem using 60 cores of an Intel Xeon Phi coprocessor. Three threads per code (180 threads) and KMP_AFFINITY=balanced will be used for the initial testing. At this point, tests will not keep track of datatransfer overhead: the implementation offloads the work to the coprocessor, but the test requests 800 iterations of the algorithm to be performed within the Xeon Phi card. This makes the transfer time due to initializing and finalizing the offload to be negligible. Only on this first post performance will be reported in both Flops and Cell/s. Here Cells/s = (n12*4)*(n22*8)*(n32*8)/timeforoneiteration; and Flops = (Cell/s) * (7*4+5).
First, the estimates for tile sizes in our datatothread assignment approach are tested by fixing the values n1_Tblock=n1 and n3_Tblock=32, and applying variations around the value n2_Tblock=4 which should be the dimension most sensitive to block/tiling size:

n1_Tblock=n1, n3_Tblock=32 





n2_Tblock 
1 
2 
3 
4 
5 
6 
7 
GFlops 
118.4 
98.9 
95.4 
98.5 
88.8 
86.2 
83.4 
GCell/s 
3.6 
3.0 
2.9 
3.0 
2.7 
2.6 
2.5 
At this configuration the ISO3DFD stencil peaks at 118.4GFlops (3.6GCell/s) with n2_Tblock=1, and performance is fairly sensitive to distinct values of the second dimension blocking factor. On a simple stencil like this one, the sensibility to variations in the slowest dimension blocking n3_Tblock can be very low when compared with sensitivity to the second dimension:

n1_Tblock=n1, n2_Tblock=1 




n3_Tblock 
6 
12 
16 
32 
48 
160 
GFlops 
110.2 
115.2 
117.3 
117.9 
118.9 
120.3 
GCell/s 
3.3 
3.5 
3.6 
3.6 
3.6 
3.6 
More complex stencils can show higher variations in performance for distinct blocking factors in the slowest dimension.
Obviously this post is just to exemplify use of tiling and heuristics for the blocking factors. The parallelization scheme itself was only one example among many options and possible implementations. The example did not attempt to explore multilevel partitioning of the data: assume the parallelization scheme assigning larger blocks to each thread and then a narrow tiling implemented directly at stencil loop computation. Neither the example tested explicit thread pinning techniques with the aim of increasing data locality.
Expect follow up discussion with code changes for data alignment and vectorization. Bandwidth analysis of the implementation might also be a helpful topic.

Finite Difference Stencils  Part 2
Here we provide sample source code for the experiments, posted here.
The ISO3DFD (Version 1) tarball should allow to build and run either an offload test (make clean; make arch=offload; ./offload.sh) or a native test (make clean; make arch=mic; mic_native.sh).
iso3dfd_V1/src/iso3dfd_stencil.cc is the actual stencil computation. Note that this version has no specific optimizations for vectorization and data alignment. It is a plain implementation of the stencil.
iso3dfd_V1/src/iso3dfd_parallel.cc is the actual task parallelization where the blocking[*] array is built and the data is assigned to the threads, as discussed in the previous blog posting.
iso3dfd_V1/src/iso3dfd_main.cc is the main driver. To simplify the code, the driver does not have specific data initialization  like random data or a Ricker wavelet source function, for example. There is no error check either. The current version supports only a single MIC card: you may consider editing the offload statements with a specific MIC card ID when testing on a multiplecard system.
The default values run an n1=928 x n2=448 x n3=840 problem size with thread blocking n1_Tblock=n1=928 (no blocking), n2_Tblock=1 and n3_Tblock=124. The OpenMP affinity is not set within the source code, so the developer should be to test different affinity schemes via the KMP_AFFINITY environment variable. The driver supports the following command line arguments:
[n1] [n2] [n3] [# threads] [# iterations] [thread block n1] [thread block n2] [thread block n3]
As a reference, script ./offload.sh exemplifies how run the sample code with KMP_AFFINITY= "granularity=thread,balanced" and 3 threads per code using 60 cores on an Intel Xeon Phi coprocessor:
$ make clean; make arch=offload; ./offload.sh
using 180 threads
allocating prev, next and vel: total 3996.56 Mbytes
n1=928 n2=448 n3=840 nreps=800 num_threads=180
n1_Tblock=928 n2_Tblock=1 n3_Tblock=124

With data transfer IN and OUT
time: 71.47 sec
throughput: 3770.01 MPoints/s
flops: 124.41 GFlops
Script ./mic_native.sh exemplifies how to run the sample code with KMP_AFFINITY= "granularity=thread,balanced" and 3 threads per code using 61 cores natively:
$ make clean ; make arch=mic ; ./mic_native.sh
allocating prev, next and vel: total 3996.56 Mbytes
n1=928 n2=448 n3=840 nreps=20 num_threads=183
n1_Tblock=928 n2_Tblock=1 n3_Tblock=124

time: 1.75 sec
throughput: 3850.36 MPoints/s
flops: 127.06 GFlops
Looking forward to hear about your experiences and experiments on this topic!