Stencil: RTM Stencil

Stencil computation is the basis for the Reverse Time Migration algorithm in seismic computing. The underlying mathematical problem is to solve the wave equation using a finite difference method. This sample computes a 3-D 25-point stencil. The computation contains 4 layer loops for each dimention and time duration. One methodology runs in straight scalar code, one using Intel® Cilk™ Plus pragma simd to allow the code to vectorize, one using cilk_spawn to include parallelization, and one with both.

 

Code Change Highlights:

The simplest strategy to parallelize this stencil is to use a parallel loop, however, memory bandwidth could become the major bottleneck for getting good speedup on multi-core processors. We developed cache oblivious implementations to optimize the cache usage and program performance. For the description and analysis of the cache oblivious algorighm, please refer to the paper "The cache complexity of multithreaded cache oblivious algorighms" written by Matteo Frigo and Volker Strumpen. This strategy utilizes cilk_spawn keywords.
Moreover, to improve the performance of the base cases or inner most loop, this code applied pragma simd for vectorization.
  • cilk_spawn
  • linear version:
    // March forward in time for(int t = t0; t < t1; ++t) { // March over 3D Cartesian grid for(int z = z0; z < z1; ++z) { for(int y = y0; y < y1; ++y) { for(int x = x0; x < x1; ++x) { // 25-point stencil applied to g_grid3D, centered at point x,y,z int point_xyz = z * num_xy + y * c_num_x + x; float *grid3D_cur = &g_grid3D[t & 1][point_xyz]; float *grid3D_next = &g_grid3D[(t + 1) & 1][point_xyz]; float div = c_coef[0] * grid3D_cur[0] + c_coef[1] * ((grid3D_cur[0 + 1] + grid3D_cur[0 - 1]) + (grid3D_cur[0 + c_num_x] + grid3D_cur[0 - c_num_x]) + (grid3D_cur[0 + num_xy] + grid3D_cur[0 - num_xy])) + c_coef[2] * ((grid3D_cur[0 + 2] + grid3D_cur[0 - 2]) + (grid3D_cur[0 + num_x2] + grid3D_cur[0 - num_x2]) + (grid3D_cur[0 + num_xy2] + grid3D_cur[0 - num_xy2])) + c_coef[3] * ((grid3D_cur[0 + 3] + grid3D_cur[0 - 3]) + (grid3D_cur[0 + num_x3] + grid3D_cur[0 - num_x3]) + (grid3D_cur[0 + num_xy3] + grid3D_cur[0 - num_xy3])) + c_coef[4] * ((grid3D_cur[0 + 4] + grid3D_cur[0 - 4]) + (grid3D_cur[0 + num_x4] + grid3D_cur[0 - num_x4]) + (grid3D_cur[0 + num_xy4] + grid3D_cur[0 - num_xy4])); grid3D_next[0] = 2 * grid3D_cur[0] - grid3D_next[0] + g_vsq[point_xyz] * div; } } } }
    cilk_spawn version:
    void co_cilk(int t0, int t1, int x0, int dx0, int x1, int dx1, int y0, int dy0, int y1, int dy1, int z0, int dz0, int z1, int dz1 ) { int dt = t1 - t0, dx = x1 - x0, dy = y1 - y0, dz = z1 - z0; int i; // Divide 3D Cartesian grid into chunk size and time period if (dx >= c_dx_threshold && dx >= dy && dx >= dz && dt >= 1 && dx >= 2 * c_distance * dt * c_NPIECES) { //divide and conquer along x direction int chunk = dx / c_NPIECES; for (i = 0; i < c_NPIECES - 1; ++i) { cilk_spawn co_cilk(t0, t1, x0 + i * chunk, c_distance, x0 + (i+1) * chunk, -c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); /*nospawn*/co_cilk(t0, t1, x0 + i * chunk, c_distance, x1, -c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); cilk_sync; cilk_spawn co_cilk(t0, t1, x0, dx0, x0, c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); } for (i = 1; i < c_NPIECES; ++i) cilk_spawn co_cilk(t0, t1, x0 + i * chunk, -c_distance, x0 + i * chunk, c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); /*nospawn*/co_cilk(t0, t1, x1, -c_distance, x1, dx1, y0, dy0, y1, dy1, z0, dz0, z1, dz1); } } else if (dy >= c_dyz_threshold && dy >= dz && dt >= 1 && dy >= 2 * c_distance * dt * c_NPIECES) { //similarly divide and conquer along y direction ... } }
  • pragma simd
  • The autovectorizer thinks the inner loop of loop_stencil existing vector dependence. That's because compiler cannot decide whether output array grid3D_next and input array grid3D_cut have overlap. pragma simd provides the programmer with the opportunity to enhance the compiler's knowledge when no dependence actually takes place.
    scalar version:
    for(int x = x0; x < x1; ++x) { // 25-point stencil applied to g_grid3D, centered at point x,y,z int point_xyz = z * num_xy + y * c_num_x + x; float *grid3D_cur = &g_grid3D[t & 1][point_xyz]; float *grid3D_next = &g_grid3D[(t + 1) & 1][point_xyz]; float div = c_coef[0] * grid3D_cur[0] + c_coef[1] * ((grid3D_cur[0 + 1] + grid3D_cur[0 - 1]) + (grid3D_cur[0 + c_num_x] + grid3D_cur[0 - c_num_x]) + (grid3D_cur[0 + num_xy] + grid3D_cur[0 - num_xy])) + c_coef[2] * ((grid3D_cur[0 + 2] + grid3D_cur[0 - 2]) + (grid3D_cur[0 + num_x2] + grid3D_cur[0 - num_x2]) + (grid3D_cur[0 + num_xy2] + grid3D_cur[0 - num_xy2])) + c_coef[3] * ((grid3D_cur[0 + 3] + grid3D_cur[0 - 3]) + (grid3D_cur[0 + num_x3] + grid3D_cur[0 - num_x3]) + (grid3D_cur[0 + num_xy3] + grid3D_cur[0 - num_xy3])) + c_coef[4] * ((grid3D_cur[0 + 4] + grid3D_cur[0 - 4]) + (grid3D_cur[0 + num_x4] + grid3D_cur[0 - num_x4]) + (grid3D_cur[0 + num_xy4] + grid3D_cur[0 - num_xy4])); grid3D_next[0] = 2 * grid3D_cur[0] - grid3D_next[0] + g_vsq[point_xyz] * div; }
    pragma simd version:
    int point_yz = z * num_xy + y * c_num_x; float * grid3D_cur = &g_grid3D[t & 1][point_yz]; float * grid3D_next = &g_grid3D[(t + 1) & 1][point_yz]; #pragma simd for(int x = x0; x < x1; ++x) { // 25-point stencil applied to g_grid3D, centered at point x,y,z float div = c_coef[0] * grid3D_cur[x] + c_coef[1] * ((grid3D_cur[x + 1] + grid3D_cur[x - 1]) + (grid3D_cur[x + c_num_x] + grid3D_cur[x - c_num_x]) + (grid3D_cur[x + num_xy] + grid3D_cur[x - num_xy])) + c_coef[2] * ((grid3D_cur[x + 2] + grid3D_cur[x - 2]) + (grid3D_cur[x + num_x2] + grid3D_cur[x - num_x2]) + (grid3D_cur[x + num_xy2] + grid3D_cur[x - num_xy2])) + c_coef[3] * ((grid3D_cur[x + 3] + grid3D_cur[x - 3]) + (grid3D_cur[x + num_x3] + grid3D_cur[x - num_x3]) + (grid3D_cur[x + num_xy3] + grid3D_cur[x - num_xy3])) + c_coef[4] * ((grid3D_cur[x + 4] + grid3D_cur[x - 4]) + (grid3D_cur[x + num_x4] + grid3D_cur[x - num_x4]) + (grid3D_cur[x + num_xy4] + grid3D_cur[x - num_xy4])); grid3D_next[x] = 2 * grid3D_cur[x] - grid3D_next[x] + g_vsq[point_yz+x] * div; }
  • cilk_spawn + pragma simd
  • To combine cilk_spawn and pragma simd, in the cilk_spawn version of loop_stencil code, simply call the co_basecase_nv with a simd version of inner loop.
    cilk_spawn + pragma simd version:
    void co_cilksimd(int t0, int t1, int x0, int dx0, int x1, int dx1, int y0, int dy0, int y1, int dy1, int z0, int dz0, int z1, int dz1 ) { int dt = t1 - t0, dx = x1 - x0, dy = y1 - y0, dz = z1 - z0; int i; // Divide 3D Cartesian grid into chunk size and time period if (dx >= c_dx_threshold && dx >= dy && dx >= dz && dt >= 1 && dx >= 2 * c_distance * dt * c_NPIECES) { //divide and conquer along x direction int chunk = dx / c_NPIECES; for (i = 0; i < c_NPIECES - 1; ++i) { cilk_spawn co_cilksimd(t0, t1, x0 + i * chunk, c_distance, x0 + (i+1) * chunk, -c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); /*nospawn*/co_cilksimd(t0, t1, x0 + i * chunk, c_distance, x1, -c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); cilk_sync; cilk_spawn co_cilksimd(t0, t1, x0, dx0, x0, c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); } for (i = 1; i < c_NPIECES; ++i) { cilk_spawn co_cilksimd(t0, t1, x0 + i * chunk, -c_distance, x0 + i * chunk, c_distance, y0, dy0, y1, dy1, z0, dz0, z1, dz1); /*nospawn*/co_cilksimd(t0, t1, x1, -c_distance, x1, dx1, y0, dy0, y1, dy1, z0, dz0, z1, dz1); } else if (dy >= c_dyz_threshold && dy >= dz && dt >= 1 && dy >= 2 * c_distance * dt * c_NPIECES) { ... } else if (dz >= c_dyz_threshold && dt >= 1 && dz >= 2 * c_distance * dt * c_NPIECES) { ... } else if (dt > c_dt_threshold) { ... } } }

Performance Data:

Note: Modified Speedup shows performance speedup with respect to serial implementation.

Modified Speedup Compiler (Intel® 64) Compiler options System specifications
simd: 5.3x
cilk_spawn: 3.5x
Both: 10.8x
Intel C++ Compiler 15.0 for Windows /O2 /Oi /fp:fast /QxHost /Qipo /MD Windows 7 SP1* (x64)
4th Generation Intel® Core™ i5-4670T CPU @ 2.30GHz
16GB memory
simd: 5.2x
cilk_spawn: 3.3x
Both: 11.7x
Intel C++ Compiler 15.0 for Linux -O2 -fp-model fast
-xHOST -ipo
Linux 2.6.32-220.el6.x86_64
4th Generation Intel® Core™ i5-4670T CPU @ 2.30GHz
8GB memory
1 Performance is calculated in throughput where Mcells/sec, or millions of cells processed per second, M-FAdd/s means number of million floating point add operations are processed per second and M-FMul/s means number of million floating point multiply operations are processed per second.
 

Build Instructions:

  • For Microsoft* Visual Studio 2010* and 2012* users:
  • Open the solution .sln file
    [Optional] To collect performance numbers (will run example 5 times and take average time):
    • Project Properties -> C/C++ -> Preprocessor -> Preprocessor Definitions: add PERF_NUM
    Choose a configuration (for best performance, choose a release configuration):
    • Intel-debug and Intel-release: uses Intel® C++ compiler
    • VSC-debug and VSC-release: uses Visual C++ compiler (only linear/scalar will run)
  • For Windows* Command Line users:
  • Enable your particular compiler environment
    For Intel® C++ Compiler:
    • Open the appropriate Intel C++ compiler command prompt
    • Navigate to project folder
    • Compile with Build.bat [perf_num]
      • perf_num: collect performance numbers (will run example 5 times and take average time)
    • To run: Build.bat run
    For Visual C++ Compiler (only linear/scalar will run):
    • Open the appropriate MicrosoftVisual Studio* 2010 or 2012 command prompt
    • Navigate to project folder
    • To compile: Build.bat [perf_num]
      • perf_num: collect performance numbers (will run example 5 times and take average time)
    • To run: Build.bat run>
  • For Linux* or OS X* users:
  • Set the icc environment: source <icc-install-dir>/bin/compilervars.sh {ia32|intel64}
    Navigate to project folder
    For Intel® C++ compiler:
    • To compile: make [icpc] [perf_num=1]
      • perf_num=1: collect performance numbers (will run example 5 times and take average time)
    • To run: make run
    For gcc (only linear/scalar will run):
    • Compile with make gcc [perf_num=1]
      • perf_num=1: collect performance numbers (will run example 5 times and take average time)
    • To run: make run
Last Updated: 
Saturday, September 28, 2013
For more complete information about compiler optimizations, see our Optimization Notice.