Optimize Seismic Image Processing on Intel® Many Integrated Core Architecture (Intel® MIC Architecture)

Introduction

The main goal of seismic imaging is to provide detailed knowledge of the subsurface parameters in order to localize a given target. Any seismic acquisition starts with the emission of waves that are propagating and reflecting (among other wave phenomena) into the ground. The wave reflections are generated by the strong parameter contrasts between layers, e.g., the density and the compression wave velocities. Those reflecting waves are then recorded by hydrophones (marine seismic) or geophones (land seismic) located on the earth’s surface or within a dedicated well.

The initial frequency of the source signal and its associated wavelength strongly determine the possible depth of investigation, since the ground is strongly attenuating the highest frequencies first, and the possible size of “visible” objects, since the spatial resolution is strongly correlated to the frequency content. This leads to different uses of seismic imaging, e.g., civil engineering, where the target will be located within the first meters to a few hundred meters using a spatial sampling varying from centimeters to a few meters. The best known use of seismic imaging is, of course, the oil and gas industry, where the target will be located up to a maximum of ten to twelve kilometers below ground with a spatial sampling of a few meters to 25 meters.

At a larger scale, wave propagation is used in seismology from global earth imaging to local seismic hazard analysis, where the source signal is created by an earthquake instead of manufactured by an air gun or by a vibrating truck used in marine and land seismic surveying, respectively. Many other differences can be analyzed since the source point location is not known, unlike the initial time of the wave emission, which is known. The source signal is also much more complex because the earthquake cracks along a fault, whereas a seismic survey has a source point of reflection.

One of the most important differences between earthquake science and seismic exploration is the amount of seismic data recorded. Seismologists can take advantage of several networks of geophones over the planet giving a few records (seismograms) of a given earthquake. On the other hand, the oil industry conducts dedicated surveys (over say 1,000 km²) in which one or several seismic vessels pull streamers of geophones 8 to 10 kilometers long, and carry a hydrophone every 25 meters. Air guns then emit seismic signals every 25 to 50 meters during 10 to 20 second intervals, and every hydrophone records the reflecting waves at a typical sampling of 2 milliseconds.

Detailed calculation of the dataset size can easily be done and results in petabytes of data that will enter the famous “seismic processing sequence” (Figure 1) before any further geological interpretation, reservoir characterization, or reservoir simulation is conducted, at which point the seismic image is obtained.


Figure 1. Summary of the seismic processing sequence. The first step after acquisition in the field is to apply several signal processing filters to the observed data to remove any unwanted noise. The second step consists of defining initial models for where to propagate the seismic waves (velocities, density, and anisotropy) before the seismic imaging step can be done using both pre-stack data and initial models.

Seismic Processing

Seismic data has to be sorted and cleaned during a so-called pre-processing sequence.

This cleaning of the observed data acquired in the field strongly depends on our ability to mimic the wave propagation and simulate or “explain” every phenomenon we see beyond the standard seismic wave reflection as multiple reflection, noise, refracted waves, etc … .

During the 1980s, seismic imaging was described and reformulated as an inverse problem involving the minimization of a cost function between the observed data (recorded in the field) and the calculated data (simulated on a computer) for a given subsurface physical description (or model).

In other words, the pre-processing has to remove any wave phenomena recorded in the field that cannot easily be reproduced by computer simulation.

Seismic imaging, or so-called “migration” (as we try to migrate the event recorded at the surface back to real depth), is then extremely dependent on the wave propagation technique.

The classification of the migration techniques is then a function of the input data (post-stack or pre-stack), the domain of application of the migration operator (time or depth), and the approximation of the wave equation being used for wave simulation.

The input data can be considered “pre-stack” in order to extract all information from every seismic trace (i.e., on hydrophone records) independently, which is intuitively the best for complex geological areas. Post-stack data, namely “stacking” the input data, has the great advantages of reducing the dataset size and of increasing the signal/noise ratio, but it destroys any specific information recorded by each independent hydrophone.

In terms of domains of application, the time migration usually leads to a fast method that poorly represents complicated structures due to strong approximations made on the propagation media. The depth migration reformulated in terms of an inverse problem, as already mentioned, will use the full physical and heterogeneous description of the subsurface and will lead to better results.

Several combinations exist that give methods, such as post-stack time migration to pre-stack depth migration, to handle very simple to extremely complicated areas.

In terms of numerical schemes, the migration technique is related to the method used to solve the forward problem. One can use integral solutions of the wave equation based on asymptotic theory leading to a migration technique known as Prestack Depth Kirchhoff migration or Beam migrations.

Other implementations are based on numerical integration of the partial differential wave equation, such as the one-way propagation approximation, which is used in this work. This leads to a wave field extrapolation migration also known as wave equation migration (WEM).

For the most complicated acquisition and propagation area, the best way to handle the full wavefield is still to use the two-way differential equation, which leads to the well-known reverse time migration (RTM) and full waveform inversion (FWI) algorithms.

Split Step Fourier Pre-stack Depth Migration (PsDM)

In this work, we consider one expression of the one-way approximation where wave propagation is done in the frequency domain. This is known as the Split Step Fourier PsDM (SSF PsDM) algorithm. This technique is highly efficient, extremely suitable for mild lateral velocity variations, and is one of the key modules in the Sinopec iCluster* seismic imaging system, which was applied to and popularized in the domestic field of the petroleum industry.

We ported the SSF PsDM algorithm to the Intel® Xeon Phi™ coprocessor, which was announced in May 2010.

3D SSF PsDM is a common shot-based migration, meaning that each shot-gather corresponding to one wave emission and the associated reflected waves recorded by every receiver can be processed independently. The waves are propagated from the source position and back-propagated from every receiver position, but only in one direction due to the one-way approximation, as already mentioned. A cross-correlation of the two wave fields is conducted at the same depth to get the image. This process is repeated to the maximum depth. Finally, every migrated shot image is stacked at the same imaging point to form the final image.



Figure 2. Cartoon representation of the Split Step Fourier PsDM process. Waves are propagated from the source position and back propagated from the receiver position. An imaging condition is applied to depth slice after depth slice by cross-correlating the two wave fields.

For a given shot, a partial image is obtained by applying the imaging filter equation in Figure 2, where Pimg is the migrated contribution of the shot-gather, Ps* (Xs,W) is the source wave field, Ps* (Xs,W) is the recorded wave field downward propagated, and Xr, Xs are the spatial coordinates of the source and receiver. Since the propagations are done in the frequency-space domain, we denote nw as the number of frequencies involved. The final image is then obtained by the summation of each shot-gather contribution.

Implementation on Intel® Xeon Phi™ coprocessor

Baseline Code and Workload Introduction

Figure 3 shows the SSF PsDM algorithm in pseudo code. Notice that the SSF PsDM kernel uses a four-nested loop that consists of one depth domain loop, one frequency domain loop, and two spatial loops. Our application employs distributed parallelism (implemented with the Intel® MPI library), using the master-slave paradigm, to process multiple shot-gather datasets simultaneously on multiple Intel® Xeon Phi™ Coprocessor . On the Intel® Xeon Phi™ Coprocessor or CPUs of a single node, we use OpenMP* to parallelize the computation of each shot-gather, and the code is unified for both architectures.


// SSF PsDM kernel
!$omp parallel
IZ loop: Depth domain loop

!$omp do
   IW  loop: Frequency domain loop
      2D DFT forward for source wavefield
      2D DFT forward for recorded wave field
      phase shift calculation
      2D DFT backward for source wavefield
      2D DFT backward for recorded wave field
      Time-shift calculation, heavy sin/cos computation
   End of IW loop

   Imaging: wavefield imaging and stack up 
End of IZ loop 

Figure 3. SSF PsDM kernel pseudo-code

In this study, we used a Negative Structure workload with a total size of 73 GB, which has 4,164 shots, 6 s recording time, and a 2 ms sampling interval. Since the shots have the same characteristics (in terms of number of receivers and recording length), we only calculated a few of them in the following performance evaluation.

We used the Discreet Fourier Transform (DFT) routines in the Intel® Math Kernel Library in the baseline code. As in Figure 3, there are four 2D DFT calls in a single iteration, leading to numerous (IZ*IW*4) DFT calls in a real workload. For example, in Negative Structure workload, the 2D DFT is called about 1,250,000 times per shot gather, which is a compute-intensive and highly parallel task where the time to offload and transfer the data becomes negligible.

Parallelize and offload SSF PsDM kernel

The first generation Intel® Xeon Phi™ coprocessor is built on a 22 nm process. It includes a PCIe card that has up to 61 cores clocked at 1.1 GHz. The Intel® Many Integrated Core Architecture is an x86-based many-core processor architecture that uses small, in-order cores and uniquely combines the full programmability of today’s general-purpose CPU architectures with the compute throughput and memory bandwidth capabilities of modern GPU architectures.

A key advantage for developers using Intel® Xeon Phi™ Coprocessors is the ability to use standard, existing programming tools and methods, such as C, C++, and Fortran, and corresponding parallelization techniques such as OpenMP*, MPI, and Intel® Threading Building Blocks (Intel® TBB)

Intel® Composer XE 2013 includes language extensions that can offload sections to run on Intel MIC architecture. The language extensions for offload computation (including the related code and data) to the Intel® Xeon Phi™ Coprocessor consist of pragmas in C/C++ and directives in Fortran.

The same source code written for Intel® Xeon Phi™ Coprocessors can be compiled and run on a standard Intel® Xeon® processor. The familiar programming model removes developer training barriers, allowing developers to focus on the problems rather than on software engineering.

For example, standard OpenMP* can be used to parallelize the kernel in Figure 3, where we see the OpenMP* parallelization over the inner IW loop. The outer IZ loop was not parallelized because it contains data dependence.

We then use Fortran directives to offload the whole OpenMP* region (Figure 4) to the Intel® Xeon Phi™ Coprocessor . (The Function SSF_PSDM_MIC_WRAPPER() shown in Figure 4 is the wrapper containing the parallelized kernel from Figure 3).


!dec$ offload target(mic)
     +     in (…)   // data needs to be copied in
     +     inout (…)  // data needs to copied in and out 

        CALL SSF_PSDM_MIC_WRAPPER (…)

Figure 4. Offloading kernel to Intel® MIC Architecture.

Multi-Coprocessor Support

In real workload scenarios, the use of multiple Intel Xeon Phi coprocessors could further boost performance. ICluster SSF PsDM code has already been implemented for shot level parallelization using Intel MPI library in a master-slave paradigm. So using two or more Intel® Xeon Phi™ Coprocessors can be easily implemented thanks to the multi-coprocessor support included in the Intel® Composer XE 2013. In Figure 5, we show how to use a simple expression to offload to a specific coprocessor: (myid-1)%M + 1, where M is the total number of coprocessors, and myid is the MPI process number

Furthermore, to develop heterogeneous code that uses both the CPU and the Intel® Xeon Phi™ Coprocessor resources, we created a Boolean variable use_mic to control whether the kernel called in a specific MPI process should be offloaded or be on the CPU instead.


!dec$ offload if (use_mic) target(mic: myid)
     +     in (…)   // data needs to be copied in 
     +     inout (…)  // data needs to copied in and out 
        CALL SSF_PSDM_MIC_WRAPPER (…)

Figure 5. Multi-coprocessor support in SSF PsDM code.

In the following test, we set up a dual-socket Intel® Xeon® processor E5-2680-based server with two Intel® Xeon Phi™ Coprocessors. We implemented a full hybrid, heterogeneous model for the SSF PsDM code to run on both CPUs and the coprocessors simultaneously. More specifically, we created 4 MPI processes: 1 master process and 3 associated worker processes. One of the worker processes is bound to the 2 CPU sockets while the other 2 are offloaded to the 2 coprocessor , respectively. We created 12 threads for the MPI process running on the CPU and 93 threads on each coprocessor.

The Intel® Xeon Phi™ Coprocessor consists of up to 61 cores, supporting up to 4 hardware threads each. In the preliminary study we found that using 3 threads per core gives the best performance.

Thread Affinity

Thread affinity is, of course, useful so that the transitions from one processor to another are minimized, thus improving cache efficiency. On the CPU side, we used “KMP_AFFINITY=compact” to bind the OpenMP* threads to the same sockets in order to launch 2 processes on the dual-socket Intel Xeon processor server. On the Intel® Xeon Phi™ Coprocessor side, we used “balanced” mode (see Figure 6) instead of “compact” to make sure all the threads were evenly spread over all the coprocessor cores, which is recognized by the OpenMP* runtime only on Intel® Many Integrated Core Architecture.

In practice, the “scatter” mode also works perfectly well since the data associated with each computation thread is independent. Otherwise, having adjacent threads work on the same data results in better performance with “balanced” because the threads can share cache. Alternatively, the KMP_AFFINITY=proclist=[…] can be used to set the thread number explicitly. For more details on processor affinity for OpenMP* threads, please refer to the OpenMP* implementation document [3].


call kmp_set_defaults("KMP_LIBRARY=balanced")

Figure 6. Thread affinity on Intel MIC Architecture

Performance Evaluation and Characterization

Performance Comparison

As shown in Table 1, the single Intel® Xeon Phi™ Coprocessor obtained a 2.2X speedup over the dual-socket, 16 core Intel Xeon processor server with baseline code. In the fully heterogeneous mode (which is, of course, the best mode to use for taking advantage of both the CPU and the coprocessor simultaneously) with two nodes configured with the Intel® Xeon processors E5-2680 (2 CPUs) and two Intel® Xeon Phi™ Coprocessors, we realized a 10.3X speedup compared to the Intel Xeon processor based server baseline performance.

Table 1. Performance Results Using Intel® Xeon® Processor and Intel® MIC Architecture

Negative Structure Workload Intel® Xeon® processor E5-2680 2.7Ghz Baseline (2 CPUs) – 16 threads Intel Xeon processor E5-2680 2.7Ghz optimized (2 CPUs) – 16 threads 1 Intel® Xeon Phi™ coprocessor offloading – 180 threads Two nodes (4 Intel Xeon processor E5-2680 + 4 Intel® Xeon Phi™ coprocessors)
Elapsed time 2802s 1342s 1273s 272s
Speedup 1 2.08 2.2 10.3

Note: in the heterogeneous test, we used unified, optimized code.

We also evaluated the core scaling (we pinned three threads on each core) on the Intel® Xeon Phi™ Coprocessor, as shown in Figure 7. Overall scalability is very good—we obtained 42X speedup on 60 cores in this workload.


Figure 7. Threading scalability on Intel® Xeon Phi™ coprocessor

Performance Profiling

To better understand the performance characterization, we used VTune™ Amplifier XE 2013 to profile the Intel Xeon Phi coprocessor side code. The hotspots for the Intel® Xeon processors E5-2680 and for the Intel® Xeon Phi™ Coprocessor are illustrated below in Table 2:

Table 2. Performance profile—hot functions

Module Function Intel® Xeon® processor E5-2680 2.7Ghz Baseline Intel® Xeon Phi™ Coprocessor Notes
Intel MKL DFT 31.4% 37.3% 2D C2C with size of 512x256
sin/cos/vccis 11.7% 21.5% Vectorized implementation in VML
PsDM phase_shift 5.3% 11.7% Not vectorized on either platform. Heavy branching.
ssf_test 2.9% 1.7% Complex operations, vectorized
imaging 32.1% 3.2% Hand-coded optimization on Intel MIC Architecture
libiomp 6.8% 16.2% Threading overhead
Total 90.2% 91.6%

First, in order to use the 512-bit SIMD engine on the Intel® Xeon Phi™ Coprocessor and get the best performance, we vectorized most of the computation in the hot loop in three ways: 1) by calling optimized implementations of Intel® MKL functions (for example, Intel® MKL DFT, sin/cos/vccis, and Intel® MKL VML), 2) by using Intel® Composer XE 2013 auto-vectorization, in the ssf_test function, we used the “fno-alias” option to tell the using Intel® Composer XE 2013 that there is no aliasing and further vectorize the function successfully, and 3) by hand-coded optimization, for example, the imaging function for wavefield imaging and stack up. We reconstructed the loop in order to vectorize the inner loop, which originally contained a lot of random memory reads and many complex data type operations. Through profiling, we found that 32.1% of the compute time was spent in baseline code, which was greatly reduced after optimization. Figure 8 shows the original version of the baseline code.


! baseline code 
!$omp parallel do private(ix,iy,iw)
do iy = 1, NY
 do ix = 1, NX
  do iw = 1, NW
    rimag(ix,iy)=rimag(ix,iy)+dat_r(ix,iy,iw)*CONJG(dat_s(ix,iy,iw))
  enddo
 enddo
enddo 

Figure 8. Baseline imaging code

The original code has very poor performance due to frequent cache misses. The optimization strategy is quite simple in this case—interchange the loops to ensure the memory access of inner loop (IX loop) is continuous. This also allows the Intel® Composer XE 2013 to auto-vectorize the loops based on the compile report. Figure 9 shows the optimized code.


! optimized code 
!$omp parallel do private(ix,iy,iw)
do iy = 1, NY
 do iw = 1, NW
  do ix = 1, NX
    rimag(ix,iy)=rimag(ix,iy)+dat_r(ix,iy,iw)*CONJG(dat_s(ix,iy,iw))
  enddo
 enddo
enddo

Figure 9. Optimized imaging code

Putting the IW loop as the outer-most loop and parallelizing it would not be better because we also need to add an atomic operation to protect the call to rimag(ix,iy) in this case, which is much worse than the one we introduced in Figure 9.

Second, as we saw in Table 2, the 2D Complex-to-Complex (C2C) DFT and sin/cos/vccis functions in Intel MKL are the hottest functions in this application, accounting for 58.8% of the compute time on Intel® Xeon Phi™ Coprocessor. The function phase_shift accounts for 11.7% of the compute time and hasn’t been vectorized yet on either platform due to heavy branching and indirect memory accesses.

Module Name 60 threads 120 threads 180 threads 240 threads
Libiomp5.so 12.1% 12.2% 16.2% 26.36%

Table 4. Threading overhead on Intel MIC Architecture

Finally, we also found an issue with load balancing as the thread count increases, as illustrated in Table 4. According to the VTune™ Amplifier XE 2013 profiling results, this is due to the __kmp_wait_sleep() function. Specifically, it is caused by the implicit barrier at the end of the work sharing construct, since the trip count of the frequency domain loop we parallelized can’t be evenly assigned by different sets of thread numbers. The ratio of the unbalanced part will increase with more threads. A scheduling strategy, such as dynamic or guided in OpenMP*, may have some benefit in other cases, but here the static scheduling strategy has the best performance based on experimentation because of the overhead involved in dynamic and guided scheduling.

Summary

In this report, we evaluated Split Step Fourier PsDM on the first generation Intel Xeon Phi coprocessor which obtained 2.2X speedup in performance over a dual-socket Intel® Xeon processor E5-2680-based server running at 2.7Ghz. We also evaluated the multi-coprocessor programming and heterogeneous computing models on the Intel® Xeon processors and Intel® Xeon Phi™ Coprocessor and obtained 10.3X speedup on the two nodes of the dual-socket Xeon processor E5-2680 based server with dual Intel® Xeon Phi™ Coprocessor over the baseline performance. In this case, we only added 8 lines of Fortran code in order to offload the PsDM kernel and manage the data flow for Intel® Xeon Phi™ Coprocessor. And what’s more important is the optimization done on Intel® Xeon Phi™ Coprocessor (e.g. interchange the nested loop in function imaging() and call sin/cos/vccis routines from Intel® MKL library) also works for CPU, which leads the PsDM performance boosted by 2.08X on CPU as well.

Meanwhile, we also conducted extensive performance analyses including scalability, performance profiling. We see performance scaling that is near linear in the number of cores (42X speedup on 60 cores with 3 threads per core).

References

[1] http://software.intel.com/mic-developer
[2] http://www.openmp.org/
[3] IDF 2011 Beijing, session STFS009: A Unified Approach to Heterogeneous programming with Intel® Multicore Processors and the Intel® Many Integrated Core Architecture
[4]Virieux J., Calandra H., Plessix R.E, A review of the spectral, pseudo-spectral, finite difference and finite element modeling techniques for geophysical imaging. Geophysical Prospecting, 2011, 59, 794–813

System configuration

  • Intel MIC Architecture Software Development Platform (Shady Cove)
  • Intel Beta Level Software (Intel MIC Compilers, Drivers, etc.)

HW specifications
2 x Intel® Xeon Phi™ coprocessors ES2 Si@1.1Ghz, 8 GB GDDR5@5.5GT/s
Host: Intel® Xeon® E5u A 2-socket platform with 2 Intel Xeon Processors X2680 (2.6 Ghz, 8 cores, 20 MB L3 Cache) with 48 GB DDR3@1333 MHz. OS: RHEL 6.3
Intel MIC Architecture SW Stack
Intel MIC Architecture kernel driver Ver. 2.1.3653-8 (Beta Release)
Flash Image/uOS: 2.1.02.0314/2.6.34.11-g4af9302

Intel® Composer XE 2013 Beta VTune™ Amplifier XE 2013 Beta. Mpich1.2.7

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Phi and Xeon are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others
Copyright © 2012 Intel Corporation. All rights reserved.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers

Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.