The Chronicles of Phi - part 4 - Hyper-Thread Phalanx – tiled_HT2

The prior part (3) of this blog showed the effects of the first-level implementation of the Hyper-Thread Phalanx. The change in programming yielded 9.7% improvement in performance for the small model, and little to no improvement in the large model. This left part 3 of this blog with the questions:

What is non-optimal about this strategy?
And: What can be improved?

There are two things, one is obvious, and the other is not so obvious.

Data alignment

The obvious thing, which I will now show you, is that vectorization improves with aligned data. Most compiler optimizations will examine the code of the loop, and when necessary, insert preamble code that tests for alignment and executes up until an alignment is reached (this is called peeling), then insert code that executes more efficiently with the now aligned data. Finally post-amble code is inserted to complete any remainder that may be present.

This sounds rather straightforward until you look at the inner loop:

#pragma simd   
          for (x = 1; x < nx-1; x++) { 
            ++c; 
            ++n; 
            ++s; 
            ++b; 
            ++t; 
            f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1] 
                + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
          } 

In the preceding code, the two terms: f1_t[c-1] , and , f1_t[c+1] will muck up the vector alignment tests since [c-1], [c] and [c+1] can never all be aligned at the same time.

Are the compiler writers smart enough to offer some measure of optimization for such a loop?

As it turns out, they are able to offer some measure of optimization for such a loop.

Due to initial unknowns, the code has to do more work in the preamble and post-amble sections, as well as a reduced number of iterations of work in the fastest interior body loop of code.

Take particular note that the input array f1_t is indexed in seven different ways. This means that the preamble code that determines alignment may have to work on minor permutations of the seven references in an attempt to narrow in on the time when the largest number of references are vector aligned. This is non-trivial for the compiler code generation, as well as a potential area for additional overhead.

What can be improved?

Use aligned data when possible

This is addressed in an additional improvement to the coding of the tiled_HT2 program.

First, we require that the dimension NX be a multiple of REALs that fill a cache line. This is not an unreasonable requirement. The value of 256 was used in the original example code. It is not too much of a restriction to require that NX must be a multiple of 16 for floats, or 8 for doubles.

To assure alignment, I changed the malloc calls to allocate the arrays to use  _mm_malloc with an alignment of cache line size (64). This is a relatively simple change. (This will be shown later after the next optimization tip that also affects allocation.)

Next, now that I know that NX is an even multiple of cache lines, and the arrays are cache line aligned, I can construct a function to process the innermost loop with the foreknowledge that six of the array references are cache aligned and two are not (the extra reference is the output array). The two that are not aligned are the references to [c-1] and [c+1]. The compiler, knowing beforehand what is aligned and what is not aligned does not have to insert code to make this determination. i.e. the compiler can reduce, or completely remove the preamble and post-amble code.

The second improvement (non-obvious improvement):

Redundancy can be good for you

Additional optimization can be achieved by redundantly process x=0 and x=nx-1 as if these cells were at the interior of the parallel pipette being processed. This means that the preamble and post-amble code for unaligned loops can be bypassed, and that the elements x=1:15 could be directly processed as an aligned vector (as opposed to one-by-one computation or unaligned vector computation). The same is done for the 16 elements where the last element (x=nx-1) computes differently than the other elements of the vector. This does mean that after calculating the incorrect values (for free) for x=0 and x=nx-1, we have to then perform a scalar calculation to insert the correct values into the x column. Essentially you exchanging two scalar loops of 16 iterations for two of (one 16-wide vector operation + one scalar operation) where the scalar operations are in L1 cache.

Adding the redundancy change necessitated allocating the arrays two vectors worth of elements larger than the actual array requirement, and returning the address of the 2nd vector for the array pointer. Additionally, this requires zeroing one element preceding and following the working array size. The allocation then provides for one vector of addressable memory (and not used as valid data). Not doing so, could result in a page fault depending on location and extent of allocation.

Change to allocations:

  // align the allocations to cache line
  // increase allocation size by 2 cache lines
  REAL *f1_padded = (REAL *)_mm_malloc(
    sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
    CACHE_LINE_SIZE);

  // assure allocation succeeded
  assert(f1_padded != NULL);
  
  // advance one cache line into buffer
  REAL *f1 = f1_padded + N_REALS_PER_CACHE_LINE;
  
  f1[-1] = 0.0;       // assure cell prior to array not Signaling NaN
  f1[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN

  // align the allocations to cache line
  // increase allocation size by 2 cache lines
  REAL *f2_padded = (REAL *)_mm_malloc(
    sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
    CACHE_LINE_SIZE);

  // assure allocation succeeded
  assert(f2_padded != NULL);
  
  // advance one cache line into buffer
  REAL *f2 = f2_padded + N_REALS_PER_CACHE_LINE;
  
  f2[-1] = 0.0;       // assure cell prior to array not Signaling NaN
  f2[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN

As an additional benefit the compiler can now generate more code using Fused Multiply and Add (FMA) instructions.

The tiled_HT2 code follows:

void diffusion_tiled_aligned(
                REAL*restrict f2_t_c, // aligned
                REAL*restrict f1_t_c, // aligned
                REAL*restrict f1_t_w, // not aligned
                REAL*restrict f1_t_e, // not aligned
                REAL*restrict f1_t_s, // aligned
                REAL*restrict f1_t_n, // aligned
                REAL*restrict f1_t_b, // aligned
                REAL*restrict f1_t_t, // aligned
                REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
                REAL cb, REAL cc, int countX, int countY) {

  __assume_aligned(f2_t_c, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_c, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_s, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_n, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_b, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_t, CACHE_LINE_SIZE);
  // countY is number of squads along Y axis
  for(int iY = 0; iY < countY; ++iY) {
    // perform the x=0:N_REALS_PER_CACHE_LINE-1 as one cache line operation
    // On Phi, the following reduces to vector with one iteration
    // On AVX two iterations
    // On SSE four iterations
    #pragma noprefetch
    #pragma simd  
    for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++) {
      f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                   + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
    } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)
    
    // now overstrike x=0 with correct value
    // x=0 special (no f1_t[c-1])
    f2_t_c[0] = cc * f1_t_c[0] + cw * f1_t_w[1] + ce * f1_t_e[0]
                + cs * f1_t_s[0] + cn * f1_t_n[0] + cb * f1_t_b[0] + ct * f1_t_t[0];
    // Note, while we could overstrike x=[0] and [nx-1] after processing the entire depth of nx
    // doing so will result in the x=0th cell being evicted from L1 cache.

    // do remainder of countX run including incorrect value for i=nx-1 (countX-1)
    #pragma vector nontemporal
    #pragma noprefetch
    #pragma simd  
    for (int i = N_REALS_PER_CACHE_LINE; i < countX; i++) {
        f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                 + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
    } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)

    // now overstrike x=nx-1 with correct value
    // x=nx-1 special (no f1_t[c+1])
    int i = countX-1;
    f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i-1] + ce * f1_t_e[i]
                   + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];

    // advance one step along Y
    f2_t_c += countX;
    f1_t_c += countX;
    f1_t_w += countX;
    f1_t_e += countX;
    f1_t_s += countX;
    f1_t_n += countX;
    f1_t_b += countX;
    f1_t_t += countX;
  } // for(int iY = 0; iY < countY; ++iY)
} // void diffusion_tiled_aligned(

diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
              REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
              REAL cb, REAL cc, REAL dt, int count) {

#pragma omp parallel
  {

    REAL *f1_t = f1;
    REAL *f2_t = f2;

    int nSquadsZ = (nz + nHTs - 1) / nHTs; // place squads across z dimension
    int nSquadsZY = nSquadsZ * ny;  // number of full (and partial) squads on z-y face
    int nSquadsZYPerCore = (nSquadsZY + nCores - 1) / nCores;

    // Determine this thread's squads
    int SquadBegin = nSquadsZYPerCore * myCore;
    int SquadEnd = SquadBegin + nSquadsZYPerCore; // 1 after last squad for core
    if(SquadEnd > nSquadsZY) SquadEnd = nSquadsZY;
    for (int i = 0; i < count; ++i) {
      int nSquads;
      // restrict current thread to its subset of squads on the Z/Y face.
      for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads) {
        // determine nSquads for this pass
        if(iSquad % ny == 0)
          nSquads = 1; // at y==0 boundary
        else
        if(iSquad % ny == ny - 1)
          nSquads = 1;  // at y==ny-1 boundary
        else
        if(iSquad / ny == (SquadEnd - 1) / ny)
          nSquads = SquadEnd - iSquad;  // within (inclusive) 1:ny-1
        else
          nSquads = ny - (iSquad % ny) - 1; // restrict from iSquad%ny to ny-1
        int z0 = (iSquad / ny) * nHTs; // home z for 0'th team member of Squad
        int z = z0 + myHT;  // z for this team member
        int y = iSquad % ny;
        // last squad along z may be partially filled
        // assure we are within z
        if(z < nz)
        {
          int x = 0;
          int c, n, s, b, t;
          c =  x + y * nx + z * nx * ny;
          n = (y == 0)    ? c : c - nx;
          s = (y == ny-1) ? c : c + nx;
          b = (z == 0)    ? c : c - nx * ny;
          t = (z == nz-1) ? c : c + nx * ny;
          diffusion_tiled_aligned(
   &f2_t[c], // aligned
   &f1_t[c], // aligned
   &f1_t[c-1], // unaligned
   &f1_t[c+1], // unaligned
   &f1_t[s], // aligned
   &f1_t[n], // aligned
   &f1_t[b], // aligned
   &f1_t[t], // aligned
                        ce, cw, cn, cs, ct, cb, cc, nx, nSquads);
        } // if(z < nz)
      } // for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads)
// barrier required because we removed implicit barrier of #pragma omp for collapse(2)
      #pragma omp barrier
      // swap buffer pointers
      REAL *t = f1_t;
      f1_t = f2_t;
      f2_t = t;
    } // count
  } // parallel
  return;
}

The performance chart below incorporates the two new programs tiled_HT and tiled_HT2.

The above chart clearly illustrates that the tiledHT2 is starting to make some real progress, at least for the small model with another 9.5% improvement. Be mindful that code alignment may still be an issue. And the above chart does not take this into consideration.

What else can be improved?

Think about it while you await part 5.

Jim Dempsey
Consultant
QuickThread Programming, LLC

 

 

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.