# Parallel Patterns in Seismic Duck – Part 2 (Thread Parallelism)

This blog is part two of four about how I applied parallel patterns to Seismic Duck, the most exciting game you will ever find about reflection seismology. I admit it’s probably the only game about reflection seismology. Part 1 covered background material and overall parallelization strategy for the game. This part covers threading. Part 3 covers vectorization and cache optimizations. Part 4 covers bookkeeping details in the real code.   The game is a bit peculiar, but the parallel patterns are generally useful things to know.

The big advantage of learning parallel patterns is twofold.  First, you learn a way to solve a class of parallel programming problems.  Second, you gain a concise way to describe the solution, because the pattern has a name.   Like a standard recipe, you can point someone unfamiliar with the pattern to paper describing the pattern, instead of having to explain it from scratch each time.

Part 1 described the update calculations as arranged in the Odd-Even pattern:

```    forall i, j {
Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
}
forall i, j {
U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
}```

Part 1 described how this form is trivial to parallelize, but suffers from doing little computation per memory reference. It defined the ratio C (Compute density) to be the number of floating-point additions per memory reference. Using the Odd-Even pattern for the wave simulation yields C=10/11=.91; the hardware can deliver C=12. In other words, memory bandwidth limits the floating-point unit to about 1/13th of its capability. [Whence the 13th?  12/(10/11)=13.2]

Updating Vx, Vy, and U in a single sweep improves C. Here is a breakdown of the work per grid point for the single sweep method:

• Single sweep (update Vx, Vy, and U):

• 8 memory references (read A, B, U, Vx, Vy; write Vx, Vy, U).

• 3 floating-point multiplications

Now C=10/8=1.25. That’s about a 37% improvement over the C=.91 value in the original code. Part 3 will show how to triple that value, to reach C=3.75 and even raise it higher. However, the improvement in C complicates parallelism. The single-sweep version fuses the original loops to look like this:

```    for( i=1; i<m; ++i )
for( j=1; j<n; ++j ) {
Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
}```

The fused code has a sequential loop nest, because it must preserve the following constraints:

• The update of Vx[i][j] must use the value of U[i][j+1] from the previous sweep.

• The update of Vy[i][j] must use the value of U[i+1][j] from the previous sweep.

• The update of U[i][j] must use the values of Vx[i][j-1] and Vy[i-1][j] from the current sweep.

Treating the grids as having map coordinates, a grid point must be updated after the points north and west of it are updated, but before the points south and east of it are updated.

One way to parallelize the loop nest is the Wavefront Pattern . In that pattern, execution sweeps diagonally from the northwest to southeast corner. But that pattern has relatively high synchronization costs. Furthermore, in this context, it has poor cache locality because it would tend to schedule adjacent grid points on different processors. Some of these inefficiencies can be ameliorated by aggregating grid points into chunks. But nonetheless it is less attractive than the following alternative.

The alternative is geometric decomposition and the Ghost Cell Pattern. Here is a picture of the geometric decomposition in Seismic Duck:

The picture is approximately to scale. Part 3 explains the rationale for choosing such skinny chunks. (Hint: they are not skinny when you use the right measurement units.) Each chunk can be updated independently by a different thread, except around its border. To see the exception, consider the interaction of chunk 0 and chunk 1. Let i0 be the index of the last row of chunk 0 and let i1 be the index of the first row of chunk 1.

1. The update of Vy[i0][j] must use the value of U[i1][j] from the previous sweep.

1. The update of U[i1][j] must use the value Vy[i0][j] from the current sweep.

The ghost cell pattern enables the chunks to be updated in parallel. Each chunk becomes a separate grid with an extra row of grid points added above and below it. These extra rows replicate information from the neighboring chunks so that each chunk has a copy of the grid points just beyond it. The copy enables a thread working on chunk 0 can get the value of U[i1][j] from the previous sweep even if the thread working on chunk 1 updates its copy of U[i1][j]. Likewise the thread working on chunk 1 can update its copy of Vy[i0][j] without waiting for the thread working on chunk 0 to update it.

The update logic ends up looking like this:

```    // Replicate borders
for( k=0; k<number_of_chunks-1; ++k ) {
copy bottom border of chunk k to top of chunk k+1
copy top border of chunk k+1 to bottom of chunk k
}
// Update chunks in parallel
forall k in 0..number_of_chunks-1 {
for( i=start[k]; i<finish[k]; ++i ) // Loop over rows in chunk k
for( j=1; j<n; ++j ) {
Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
}
}```

For more insight into this pattern, see the Ghost Cell Pattern paper.  It’s a polished paper on an important pattern for grid-based simulations.

The final part 3 of these blogs will cover the cache optimization patterns, which are critical because memory bandwidth, not computation, is the bottleneck.

For more complete information about compiler optimizations, see our Optimization Notice.