# Parallel Patterns in Seismic Duck – Part 2 (Thread Parallelism)

By Arch D. Robison (Intel), published on May 9, 2010

This blog is part two of four about how I applied parallel patterns to Seismic Duck, the most exciting game you will ever find about reflection seismology. I admit it’s probably the only game about reflection seismology. Part 1 covered background material and overall parallelization strategy for the game. This part covers threading. Part 3 covers vectorization and cache optimizations. Part 4 covers bookkeeping details in the real code. The game is a bit peculiar, but the parallel patterns are generally useful things to know.

The big advantage of learning parallel patterns is twofold. First, you learn a way to solve a class of parallel programming problems. Second, you gain a concise way to describe the solution, because the pattern has a name. Like a standard recipe, you can point someone unfamiliar with the pattern to paper describing the pattern, instead of having to explain it from scratch each time.

Part 1 described the update calculations as arranged in the Odd-Even pattern:

forall i, j {

Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);

Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);

}

forall i, j {

U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));

}

Part 1 described how this form is trivial to parallelize, but suffers from doing little computation per memory reference. It defined the ratio C (Compute density) to be the number of floating-point additions per memory reference. Using the Odd-Even pattern for the wave simulation yields C=10/11=.91; the hardware can deliver C=12. In other words, memory bandwidth limits the floating-point unit to about 1/13th of its capability. [Whence the 13th? 12/(10/11)=13.2]

Updating Vx, Vy, and U in a single sweep improves C. Here is a breakdown of the work per grid point for the single sweep method:

- Single sweep (update Vx, Vy, and U):
- 8 memory references (read A, B, U, Vx, Vy; write Vx, Vy, U).

- 10 floating-point adds

- 3 floating-point multiplications

Now C=10/8=1.25. That’s about a 37% improvement over the C=.91 value in the original code. Part 3 will show how to *triple* that value, to reach C=3.75 and even raise it higher. However, the improvement in C complicates parallelism. The single-sweep version fuses the original loops to look like this:

for( i=1; i<m; ++i )

for( j=1; j<n; ++j ) {

Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);

Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);

U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));

}

The fused code has a *sequential *loop nest, because it must preserve the following constraints:

- The update of Vx[i][j] must use the value of U[i][j+1] from the
*previous*sweep.

- The update of Vy[i][j] must use the value of U[i+1][j] from the
*previous*sweep.

- The update of U[i][j] must use the values of Vx[i][j-1] and Vy[i-1][j] from the
*current*sweep.

Treating the grids as having map coordinates, a grid point must be updated *after *the points north and west of it are updated, but *before *the points south and east of it are updated.

One way to parallelize the loop nest is the Wavefront Pattern . In that pattern, execution sweeps diagonally from the northwest to southeast corner. But that pattern has relatively high synchronization costs. Furthermore, in this context, it has poor cache locality because it would tend to schedule adjacent grid points on different processors. Some of these inefficiencies can be ameliorated by aggregating grid points into chunks. But nonetheless it is less attractive than the following alternative.

The alternative is geometric decomposition and the Ghost Cell Pattern. Here is a picture of the geometric decomposition in Seismic Duck:

The picture is approximately to scale. Part 3 explains the rationale for choosing such skinny chunks. (Hint: they are *not* skinny when you use the right measurement units.) Each chunk can be updated independently by a different thread, except around its border. To see the exception, consider the interaction of chunk 0 and chunk 1. Let i_{0} be the index of the *last *row of chunk 0 and let i_{1} be the index of the *first *row of chunk 1.

- The update of Vy[i
_{0}][j] must use the value of U[i_{1}][j] from the*previous*sweep.

- The update of U[i
_{1}][j] must use the value Vy[i_{0}][j] from the*current*sweep.

The ghost cell pattern enables the chunks to be updated in parallel. Each chunk becomes a separate grid with an extra row of grid points added above and below it. These extra rows replicate information from the neighboring chunks so that each chunk has a copy of the grid points just beyond it. The copy enables a thread working on chunk 0 can get the value of U[i_{1}][j] from the previous sweep even if the thread working on chunk 1 updates its copy of U[i_{1}][j]. Likewise the thread working on chunk 1 can update its copy of Vy[i_{0}][j] without waiting for the thread working on chunk 0 to update it.

The update logic ends up looking like this:

// Replicate borders

for( k=0; k<number_of_chunks-1; ++k ) {

copy bottom border of chunk k to top of chunk k+1

copy top border of chunk k+1 to bottom of chunk k

}

// Update chunks in parallel

forall k in 0..number_of_chunks-1 {

for( i=start[k]; i<finish[k]; ++i ) // Loop over rows in chunk k

for( j=1; j<n; ++j ) {

Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);

Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);

U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));

}

}

For more insight into this pattern, see the Ghost Cell Pattern paper. It’s a polished paper on an important pattern for grid-based simulations.

The final part 3 of these blogs will cover the cache optimization patterns, which are critical because memory bandwidth, not computation, is the bottleneck.

## 4 comments

TopArch D. Robison (Intel) said on Aug 7,2010

I posted part 3. See first paragraph of part 1 or 2 for link to it. Part 3 got long enough that I deferred some details to a part 4 that was not in my original plan.

Arch D. Robison (Intel) said on Aug 3,2010

I've started working on part 3. It may be a few weeks before it is ready.

vanroose said on Jul 7,2010

Thanks for the post. I am waiting for part 3. Any chance of having this soon?

Srinivas.Kulkarni said on May 17,2010

Pretty neat article... Thanks

## Add a Comment

Sign inHave a technical question? Visit our forums. Have site or software product issues? Contact support.