Parallel Patterns in Seismic Duck – Part 3 (Vectorization and Tiling)

This blog is part three of three about how I applied parallel patterns to Seismic DuckPart 1 covered background material and overall parallelization strategy for the game.  Part 2 covered threading.  This part covers vectorization, tiling, and wide halo. 

Vectorization

I'll describe vectorization first.  The kernel of interest is:

        for( i=start[k]; i<finish[k]; ++i )     // Loop over rows in chunk k
            for( j=1; j<n; ++j ) {
                Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
                Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
                U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
            }

A Core-2 core can generally operate on 4-element vectors just as fast as it can operate on scalars.  Thus  scalar code wastes 3/4 of the machine's capability.  [2015-Mar-23 update: Intel(R) AVX processors can operate on 8-element vectors likewise, in which case the scalar code wastes 7/8 of the machine's capability.]  The kernel is trivial to vectorize.  Let X[i:4] denote a four-element SSE vector, starting at subscript i.  In other words, X[i:4] is the quadruple (X[i], X[i+1], X[i+2], X[i+3]). To vectorize this kernel:

  • Replace the scalar operations with SSE vector operations.
  • Adjust the stride of the j loop index to loop over SSE vectors instead of scalars

The vectorized code looks like this conceptually:

        for( i=start[k]; i<finish[k]; ++i )     // Loop over rows in chunk k
            for( j=1; j<n; j+=4 ) {
                Vx[i][j:4] += (A[i][j+1:4]+A[i][j:4])*(U[i][j+1:4]-U[i][j:4]);
                Vy[i][j:4] += (A[i+1][j:4]+A[i][j:4])*(U[i+1][j:4]-U[i][j:4]);
                U[i][j:4] += B[i][j:4]*((Vx[i][j:4]-Vx[i][j-1:4]) + (Vy[i][j:4]-Vy[i-1][j:4]));
            }

The vectorization is legal because afterwards, the updates of Vx and Vy still use the old values of U, and the updates of U still use the new values of V. Note that allowing multiple iterations to run concurrently using threading would not be safe, as explained in part 2. The loop thus demonstrates the the point that not all vectorizable loops can be multithreaded, at least without further transformations, e.g. the ghost cell pattern covered in part 2.

The Intel 12.0 C++ compiler introduced support for the [j:4] array syntax. It's really convenient to use. Alas I was developing Seismic Duck without that support, so I resorted to SSE intrinsics and macros. The macros look like this:

        #define CAST(x) (*(__m128*)&(x))    /* for aligned load or store */
        #define LOAD(x) _mm_loadu_ps(&(x))    /* for unaligned load */
        #define ADD _mm_add_ps
        #define MUL _mm_mul_ps
	#define SUB _mm_sub_ps


With these macros the loop body looks like:

        __m128 u = CAST(U[i][j]);
        __m128 a = CAST(A[i][j]);
        CAST(Vx[i][j]) = ADD(CAST(Vx[i][j]),MUL(ADD(LOAD(A[i][j+1]),a),SUB(LOAD(U[i][j+1]),u)));
        CAST(Vy[i][j]) = ADD(CAST(Vy[i][j]),MUL(ADD(CAST(A[i+1][j]),a),SUB(CAST(U[i+1][j]),u)));
	CAST(U[i][j]) = ADD(u,MUL(CAST(B[i][j]),ADD(SUB(CAST(Vx[i][j]),LOAD(Vx[i][j-1])),SUB(CAST(Vy[i][j]),CAST(Vy[i-1][j])))));


It's efficient ugly non-portable vector code.  I'm hoping that other compilers adopt the array syntax so I can write efficient beautiful portable vector code.

Tiling

Vectorization alone does not improve performance of the kernel.   It slams it harder against the "memory wall".   The "memory wall" refers to the limit on memory bandwidth.  Reducing consumption of memory bandwidth becomes critical.  Parts 1 and 2 discussed the importance of the ratio C (Compute density).  It is defined as the ratio of the number of floating-point additions per memory reference.  Now I'll describe how raise the value of C multifold with tiling.

The kernel as shown so far updates the wavefield for a single time step. Now it's time to bring in more context. The kernel occurs inside a loop that steps over time, conceptually like this:

    for( t=0; t<T; ++t ) {  // T = number of time steps per video frame (typically 3)
        replicate borders
        for each chunk k in parallel {
            for( i=start[k]; i<finish[k]; ++i ) // Loop over rows in chunk k
                for( j=1; j<n; j+=4 ) {
                    Vx[i][j:4] += (A[i][j+1:4]+A[i][j:4])*(U[i][j+1:4]-U[i][j:4]);
                    Vy[i][j:4] += (A[i+1][j:4]+A[i][j:4])*(U[i+1][j:4]-U[i][j:4]);
                    U[i][j:4] += B[i][j:4]*((Vx[i][j:4]-Vx[i][j-1:4]) + (Vy[i][j:4]-Vy[i-1][j:4]));
                }


With the loop structure above, each iteration of the t loop has to reload Vx, Vy, and U from memory.  If the t loop could be changed from being the outermost loop to being the innermost loop, the reloads could be avoided, because each iteration would reuse values cached by the previous iteration. But naively permuting the loops delivers wrong answers, because updates would use incorrect values.

Ignore the "for each chunk" loop for now.  I'll come back to that in the next section.  For now, consider the other three loops. They are an abstract loop over points in space-time with coordinates (t,i,j).  The constraints are that point (t,i,j) must be updated:

  • after update of point (t-1,i,j) (required to preserve dependencies across time)
  • after update of points (t,i-1,j) and (t,i,j-1) (required to update U correctly)
  • before update of point (t,i+1,j) and (t,i,j+1) (required to update Vx and Vy correctly)

The tiling will update the points in groups. Each group is a rectangular subset of points, called a tile.  A tile is updated using two nested loops as shown in the vectorized code, but i and j range over the tile instead of the entire spatial grid. An additional outer loop sequences the tiles.

Picture the kernel as a problem in tiling a floor. The floor runs along the i and j axes.  The t axis is vertical to the floor.  The floor must be tiled T levels deep for each video frame.  T is typically 3 in Seismic Duck. Each tile of size 1×7×112.  That is, each tile is:

  • 1 unit thick along the t axis,
  • 7 units thick along the i axis, and
  • 112 units along the j axis.

Why such a long skinny tile and where did the 112 come from? The answer is that the tiles are actually square, when measured in cache lines. It's cache lines that count, because they are the quantum of information transferred between memory and the CPU. A cache line is typically 64 bytes on my intended target hardware, and I'm using single-precision floating-point.  So that's 16 floats per line.  The 7×7 dimension in cache lines works out to be about the right size.  I chose the size by a back of the envelope calculation (based on the size of the L1 cache) and then did some experiments.

Tiles can be cut into smaller tiles if necessary, particularly when working around the edges of the grid.  Part 4 will way more about this.  The previous constraints on the order of updating points apply similarly to tiles. A tile can be laid at level t only if:

  • all the area under it is tiled to level t-1,
  • its north and west sides will not be exposed,
  • its south and east sides will be exposed.

Another way to word the constraints is from the viewpoint of an ant crawling over the tiles:

  • No overhanging tiles allowed.
  • If an ant crawls north or west, it must never go down.
  • If an ant crawls south or east, it must never go up.

Below is a picture showing both legal and illegal partial tilings. The three levels of tiles are distinguished by different colors.


The red X marks tiles that violate one or more of the constraints.

There is one more thing to model: the cache hardware. Laying down a tile corresponds to performing a calculation on data that must be in cache.  Each tile is a 7x7 grid of cache lines.  Call a line hot if it is in cache, and cold if it is out of cache.  If a cache line is not touched, it cools until it becomes cold. A cache line can be reheated (brought back into cache), but that takes precious time. These effects add a constraint to the tiling exercise, because the computation of points in a tile depend on nearby points in space. When laying a tile, the area under and close to the new tile must be hot.

Laying a complete layer before doing the next layer is inefficient because each time a tile is laid, the area until it has grown cold.  It's far better to lay a tile on top of a hot tile, ideally the one laid immediately earlier.  A good pattern is to lay each tile almost on top of the previous tile, but shifted north and west one unit to meet the aforementioned "ant constraints", until the stack of offset tiles is T levels deep. Then lay another stack immediately east of this stack. Once a row of stacks is completed, do the next row.

The pattern looks tricky around the edges of the floor, but there is an easy way to deal with that. Imagine that we do not care if a tile goes beyond the floor boundary, and the overhang constraint does not apply to beyond the floor. Then it's just a matter of starting the pattern beyond the floor, and cutting off portions of a tile that extend beyond the floor.

Below is an animation of the tiling sequence for a small 23x16 floor with 7x7 tiles.

Note: I wrote the animation by hand using GIMP, frame by frame. I now greatly appreciate the effort artists put into hand-drawn animations!

With the typical depth of 3, the tiling pattern almost triples the compute density.  I say "almost" because I may not be getting perfect reuse around the edges.   But it's close enough.   Some programs sequence the stacks more elaborately, such as zig-zagging diagonally or recursively, in order to exploit multiple levels of cache. The simple pattern I used works well for Seismic Duck because the outer level cache (e.g. L2 cache on a my system) is big enough to hold the cache lines that are reused between rows of stacks. T

The final loop structure looks like:

    for each video frame do {
        replicate chunk borders widely // See next section
	for each chunk k in parallel {
	    for each tile z in chunk k, in tiling sequence order, do
		for each i in z do
		   for each j in z do {
                       Vx[i][j:4] += (A[i][j+1:4]+A[i][j:4])*(U[i][j+1:4]-U[i][j:4]);
                       Vy[i][j:4] += (A[i+1][j:4]+A[i][j:4])*(U[i+1][j:4]-U[i][j:4]);
                       U[i][j:4] += B[i][j:4]*((Vx[i][j:4]-Vx[i][j-1:4]) + (Vy[i][j:4]-Vy[i-1][j:4]));
		    }

Part 4 will say more about implementing the tiling sequence.

Wide Halo

Note that I added the word "widely" to the replication step. Because the tiling sequence advances a chunk T timesteps instead of one, it is not enough to copy just the immediate border points. Instead, a T-point wide border has to be copied, and the space to be tiled is not a rectangular prism in (t,i,j) space, but instead is a frustum with a rectangular base in the (0,*,*) plane, and walls that slope 45 degrees. The top surface of the frustum in the (T-1,*,*) plane.  The top surface is the chunk's contribution to the final computation. The bases overlap by a region T points wide. The overlapping portions of the frustums correspond to redundant calculations.  See the section titled "Wide Halo" of the Ghost Cell Pattern paper for more on this technique.

Summary of Vectorization and Cache Optimizations

Vectorization can quadruple or octuple the calculations per instruction, making memory bandwidth the bottleneck. Tiling in space-time reduces consumption of memory bandwidth. Doing so requires the wide halo pattern.

The tiling involves some complex bookkeeping.  Boundary condition physics makes it even more complicated.  Part 4 will describe how I dealt with the bookkeeping without going mad.

For more complete information about compiler optimizations, see our Optimization Notice.
Categories: