Parallelism Patterns in Seismic Duck - Part 1 (Background)

This blog is first of four about applying parallel patterns to a real program.  I'll point out some excellent parallel pattern documents on the Web and an unusual video game of mine.  This blog discusses background material and overall parallelization strategy.  Part 2 covers threading.  Part 3 covers vectorization and cache optimizations.  Part 4 covers bookkeeping details in the real code.

The game is Seismic Duck.  I recently rewrote it for Windows platforms.  It’s a complete rewrite of something I wrote for Macs in the mid 1990s.  It's a freeware game about reflection seismology, which is imaging underground structures by sending soundwaves into the ground and interpreting the echos. The program teaches the topic by letting the user experiment interactively. It sounds technical, but even children "get it" after playing with it a while.  I’m fond of writing games based on peculiar subjects. For example, Frequon Invaders is a Fourier transform game.  Reflection seismology is a subject I worked on at Shell in the early 1990s.

This series of blogs is about how and why Seismic Duck was parallelized with Intel® TBB differently than a related demo in the TBB distribution. Seismic Duck runs three independent core computations:

    • Seismic wave propagation and rendering.

    • Gas/oil/water flow through a reservoir and rendering.

    • Seismogram rendering.



I run all three in parallel, using tbb::parallel_invoke, and vectorize the code where I can. At a hight level, it follows the Three Layer Cake pattern. (I'm one of the authors of that paper.) However, that level of parallelism was not enough to get a good animation speed on my machine.  The bottleneck was the wave propagation simulation, which is the focus of these blogs.

Some background on the numerical simulation of waves is necessary.  I used the popular "staggered grid" finite-difference time-domain (FDTD) method.  There are five 2D arrays, each representing a scalar field over a 2D grid.  Three of the arrays represent variables that step through time.  The other two represent rock properties.  The arrays are:

    • B[i][j]: A constant array with vertex centered values. It’s constant in the sense that it changes only when the subsurface geology changes.

    • A[i][j]: A constant array with vertex-centered values related to rock properties. The FDTD method actually requires the edge-centered values, but that would double the memory bandwidth for reading it, because for each interior grid point there is a horizontal edged and a vertical edge. As shown later, memory bandwidth, not computation, is the bottleneck. So to reduce memory references, edge-centered values are computed by averaging the two nearest vertex-centered values. Doing so saves memory references because there twice as many edges as vertices. The vertices actually store ½ their physical value, so that each edge value can be computed as the sum of its two endpoint values.

    • U[i][j]: A variable array with vertex-centered values representing how much the rock is compressed at that point. It is variable in the sense that it changes for every time step.

    • Vx[i][j], Vy[i][j]: Two variable arrays that represent the x and y component of the point’s velocity. This should not to be confused with the velocity of sound through the rock. These two grids are edge-center. Vx[i][j] represents a physical value between grid points [i][j] and [i][j+1]. Vy[i][j] represents a physical value between grid points points [i][j] and [i+1][j].



The picture below shows one square of the grid and how the array elements relate to the scalar fields. Notice how Vx and Vy are edge centered.

Array Vx and Vy are not only staggered halfway in space, they are staggered halfway in time.  At the beginning of a time step, Vx and Vy are a half a time step behind U.  The simulation advances Vx and Vy a full step, so that they become a half time step head of U. Then the simulation advances U a full time step. The algorithm is sometimes called “leap frog” because of how the arrays advance over each other in time.

The advantage of staggering and leap frogging is that it delivers results accurate to 2nd order for the cost of a 1st order approach.  The update operations are beautifully simple:

    forall i, j {
Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
}
forall i, j {
U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
}


The TBB demo "seismic" and Seismic Duck use two very different approaches to parallelizing these updates.  I wrote the original versions of both.   (Anton Malakhov has since made many improvements to my original TBB version.)  The reason for different approaches is different purposes:

    • The TBB demo is supposed to show how to use a TBB feature ("parallel_for") and a basic parallel pattern. I kept it as simple as possible.

    • Seismic Duck is written for high performance, at the expense of increased complexity. It uses several parallel patterns. It also has more ambitious numerical modeling, notably the use of perfectly matched layers to reduce artificial reflections from the simulation boundaries. (See the URL for a description of this fascinating technique that involves using complex-valued materials that do not exist in the real universe, but can exist in software!)



The pattern behind the TBB demo is "Odd-Even Communication Group" in time and a geometric decomposition pattern in space. The code flip-flops between updating the pair of arrays Vx and Vy , and updating array U. (Note to readers of that code: The variable names are S, T, and V instead of Vx, Vy, and U.) Each forall can be performed with a tbb::parallel_for loop. From a patterns perspective, this is a geometric decomposition pattern. These patterns make a good introduction to parallel programming, and require minimal changes for parallelization.

The big drawback of the Odd-Even pattern is memory bandwidth. When Seismic Duck is played on a typical 21” widescreen monitor, each grid has dimensions 531x1520. (You do not see the entire grid in the game – there is a hidden 16 pixel thick border for the perfectly matched layers.) That works out to about 16 MByte for all 5 grids. High end server have caches that large, but I’m targeting current desktop machines, which have caches in the 1-4 MByte range. Fortunately, a few consecutive rows of each array do fit in cache.

Thus using the Odd-Even pattern, each grid point is loaded once from main memory per time step. Here is a breakdown of work per grid point for each sweep:

    • First sweep (update Vx and Vy):

        • 6 memory references (read A, U, Vx, Vy; write Vx, Vy).

        • 6 floating-point adds

        • 2 floating-point multiplications


    • Second sweep (update U):

        • 5 memory references (read B, Vx, Vy, U; write U)

        • 4 floating-point adds

        • 1 floating-point multiplication




The multiplications are insignificant because the hardware can overlap them with the additions. The key consideration is the (6+4) floating-point additions per (6+5) memory references. For brevity, from now on I'll call this ratio C (for Compute density). A ratio of C=(6+4)/(6+5)=10/11≈0.91 is a serious bottleneck. My Core-2 Duo system can deliver 8 floating-point additions per clock (two cores each using 4-wide SIMD). But the memory bandwidth limits my system to 2 single-precision memory references every 3 clocks. (I used the STREAM benchmark to determine this.) So C for the hardware is 8/(2/3) = 12.

To summarize, C=12 for the hardware but C≈0.91 for the Odd-Even code. Thus the odd-even version delivers a small fraction of my machine’s theoretical peak floating-point performance. Phrased another way: forget the floating-point, it's the memory references that matter. Another approach was needed. Part 2 shows how I raised C to 1.25 and multi-threaded the wave simulation using TBB. Part 3 describes how I raised C much higher and vectorized the code.

For more complete information about compiler optimizations, see our Optimization Notice.