The Chronicles of Phi - part 1 The Hyper-Thread Phalanx

Recently I have been blessed with the opportunity to have a workstation with dual Intel® Xeon Phi™ 5110P coprocessors.  One of the design goals of the Xeon Phi™ was to provide a programming environment that is nearly the same as that on the host system. The Intel engineers based the design on an extension of their existing architecture and did a marvelous job of integrating this into a package complete with drivers and language extensions.

As much as there is a desire to make the programming of the coprocessor be the same as programming the host , the practical aspect is it is a different “beastie” with respect to code optimization. Several factors play into optimization, principle amongst them are:

o In-order core execution
o Distributed L1/L2 caches
o Large number of memory banks
o Substantial number of pending hardware prefetches
o Four-wide Hyper-Threading

Other than for a “look-see” at the two past IDF programming labs, I have to admit that I have had no prior programming experience with Xeon Phi. Ordinarily this lack of experience would be a disadvantage for someone writing a technical article. However, the intent of this article is to present a learning experience. The journey is more important to the programmer than the destination. For the journey provides a means to the end, the end being a better program.

In search of a relatively simple but functional application I chose to borrow someone else’s program that is representative of real work. What I chose was an example from chapter 4 of Intel® Xeon Phi™ Coprocessor High-Performance Programming, by Jim Jeffers, and James Reinders - Morgan Kaufman publications. You can find this on as well as at other book outlets.

Since the hardware on my system is slightly different from that in the book, I reran the test programs to get my own baseline results.  My host system has one Xeon E5-2620 in a motherboard with the x79 chipset. Two Xeon Phi™ 5110P coprocessors are installed; one was used for the test. The book's Xeon Phi™ coprocessor was a pre-production version that had 61 cores, whereas 5110P’s have 60 cores each. Therefore the tests run will use slightly different thread and core counts from those used in the examples in the book.

This article will not duplicate the content of chapter 4 of the book; rather I take the end of chapter 4 as a starting point.

The nature of the program under investigation is to simulate diffusion of a solute through a volume of liquid over time within a 3D container such as a cube. The original sample code is available from, Copyrighted© Naoya Maruyama. The full copyright can be found both in the aforementioned book and in the sample code. This blog is written with the assumption that you have a Xeon Phi™ coprocessor and have already downloaded the sample programs.

Starting from the end of the chapter with the results data:

This chart is not a cut and paste from the book. Instead, it is derived from running the sample programs configured for my system (60 cores not 61 cores).  The chart data represents the average of three runs of the program. This was done to smooth out any adverse interaction caused by the system O/S.

Note, the numbers represent the ratio of the Mega-Flops of the book's various threaded programs versus the Mega-Flops of a single thread run of the base program. The chart is not a scaling chart because we are comparing the performance as we tune the implementation, and not a chart of as we change the number of cores and/or threads.

base is the single thread version to the program
omp is the simplified conversion to parallel program
ompvect adds simd vectorization directives to the inner loop
peel removes unneeded code from the inner loop
tiled partitions the work to improve L1 and L2 cache hit ratios

The chart above illustrates the typical progress of the optimization process:

o Start with a functioning algorithm
o Introduce parallel code
o Improve vectorization (interchangeable with adding parallel code)
o Remove unnecessary code (improve efficiency)
o Improve cache hit ratios

After you finish the above optimization process there is usually not much left to optimize. That is, unless you’ve overlooked something important.

N.B. In the book, the results for “tiled” showed approximately 426x the performance over the baseline program using 61 cores and 183 threads.  However, the book had a typo indicating 435x improvement, the numbers reported 106752.742 MFlops vs 250.763 Mflops yield 425.71x. Setting the typo aside, my system performed better showing 455.52x improvement of tiled over base.

At the end of the chapter, the authors suggest that additional performance could be achieved with additional effort. That additional effort will be the subject of this blog.

Prior to listing the code changes, I think it is important to emphasize that given the nature of the Xeon Phi™ you would not use it to run a single threaded application. Therefore a better chart to use as a frame of reference is one that illustrates the speedup against the simple parallel OpenMP implementation (omp on charts).

The above chart clearly illustrates the benefits you get from following the advice in chapter 4 of Intel® Xeon Phi™ Coprocessor High-Performance Programming. Now let’s get down to business to see where we can go from here.

First, let me state that minor edits to the original code were made, none of which affects the intent of the code. The changes were made to make it easier to run the tests.

The first change was to permit an optional -DNX=nnn on the C++ compiler command line to define the NX macro. This macro was used to specify the dimensions of a cube: nx=NX, ny=NX, nz=NX that sets the problem size. When the macro is not defined as a compiler command line option, the original value of 256 is used. Should the macro NX be defined on the command line, the command line value will be used. This was done to permit the Makefile to build a “small” model 256x256x256, and a large model 512x512x512 and do so without having to edit the sources.

The compute section of the book's tiled code is as follows:

#pragma omp parallel 
    REAL *f1_t = f1; 
    REAL *f2_t = f2; 
    int mythread; 
    for (int i = 0; i < count; ++i) { 
#define YBF 16 
#pragma omp for collapse(2) 
      for (int yy = 0; yy < ny; yy += YBF) { 
      for (int z = 0; z < nz; z++) { 
        int ymax = yy + YBF; 
        if (ymax >= ny) ymax = ny; 
        for (int y = yy; y < ymax; y++) { 
          int x; 
          int c, n, s, b, t; 
          x = 0; 
          c =  x + y * NXP + z * NXP * ny; 
          n = (y == 0)    ? c : c - NXP; 
          s = (y == ny-1) ? c : c + NXP; 
          b = (z == 0)    ? c : c - NXP * ny; 
          t = (z == nz-1) ? c : c + NXP * ny; 
          f2_t[c] = cc * f1_t[c] + cw * f1_t[c] + ce * f1_t[c+1] 
              + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
#pragma simd   
          for (x = 1; x < nx-1; x++) { 
            f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1] 
                + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
          f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c] 
              + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
        } // tile ny 
      } // tile nz 
      } // block ny 
      REAL *t = f1_t; 
      f1_t = f2_t; 
      f2_t = t; 
    } // count 
  } // parallel 

What can be done to improve the above code?

The compute intensive loop contains a single floating point expression with a handful of integer increments. The compiler optimization -O3 is going to unroll this loop and reduce the number of (x < nx-1) test, as well as reduce the number of ++ operations through use of offsets. More importantly, the compiler will vectorize this loop.  The strength of the Xeon Phi™ is in its wide vector unit.

This loop handles nx-2 indices of dimension x (254 in the small problem case). Whereas the expressions preceding and following the loop handle the remaining 2 indices of x (x=0 and x=nx-1). With NX=256, the inner loop accounts for 254/256 = 99.22% of the work.

The compiler does a good job of deciding where and when to insert prefetches. You can look at the disassembly code if you wish. I experimented with different #pragma prefetch ... options all of which produce lower performance than using the compiler generated prefetches. Curiously I also noted that for this example using #pragma noprefetch, produced marginally faster code (<1% faster). This is likely due to the fact that the loop is simple enough that the hardware prefetcher will perform the prefetching for you. Each unnecessary prefetch instruction that is eliminated from the loop gains back at least one clock cycle.

The authors of the book found the best performance was achieved using KMP_AFFINITY=scatter and OMP_NUM_THREADS=183 (three threads from each core). On my machine, with one fewer core, 180 threads were used. The tiling (YBF=16) was used to improve cache hit probability, and we assume the authors of the book tuned this factor to yield best performance.

I ask again:

What can be done to improve the above code?

Well, there are some problems with the code. These problems are not obvious when reading the code and relating it to the computational characteristics of the Intel® Xeon Phi™ many in order core, each with four hardware threads. The book authors found that using three of the four hardware threads per core gave optimal performance (of this program) for use with the tiled algorithm. My test runs produced comparable numbers.

The first optimization oversight made by the authors of the book – The Hyper-Thread Phalanx

The term phalanx is derived from a military formation used by the ancient Greeks and Romans. The formation generally involved soldiers lining up shoulder to shoulder, shield to shield multiple rows deep. The formation would advance in unison becoming “an irresistible force”. I use the term Hyper-Thread Phalanx to refer to the Hyper-Thread siblings of a core being aligned shoulder-to-shoulder and advancing forward.

I feel I must preface this section with the presumption that the authors of the book had a publishing deadline as well as had additional programming examples to write, test, edit and insert into the manuscript, and therefore had little time to explore this method.

The first optimization technique I employed was to gang the hardware threads of each core into a cohesive working unit (phalanx) thus enhancing the L1 and/or L2 cache hit ratio. In the book’s code, each thread worked in a different section of the 3D problem space. Looking down the X direction (view from Z/Y plane), we can illustrate each thread’s L1 and/or L2 cache activity as it processes each column of X.

For illustrative purposes, the left image represents a view of the Z/Y plane, with X going into the page. Also, for illustrative purposes, we show blue to indicate a cold cache miss, yellow to indicate a partial cache-hit on the first traversal of X. As computation progresses to x=nx-1, the c column has three reads (c+1, c, c-1). With the c+1 taking the memory hit, the c and c-1 reads will benefit with cache hit. At the end of the loop, all five columns c, n, s, t, b will be located in L1 and/or L2 cache (so will the f1_t[c+1] and f1_t[c-1] entries being in the same column as to f1_t[c]).

The right image illustrates each next traversal along x (++y, x=0:n-1). Note now that two of the five columns of x (c and t) are now hot (red) in L1 and/or L2 cache, while three are cold (blue), meaning not in cache. Additionally, the [c-1] and [c+] references will be in cache. This same activity is progressing for all the threads, including the HT siblings of each core. Counting 3 hits for the center column we can estimate the cache hit ratio as 3/7 (42.86%). This hit ratio is estimated without prefetches, and assuming the x depth does not exceed the cache capacity.

What is non-optimal about this strategy?

First, all the HT siblings of a core share the L1 and L2 cache of that core. Each core has separate L1 and L2 caches. Therefore, each of the three HT siblings of the original tiled code (which used three out of the four available HT siblings) could effectively only use 1/3rd of the core’s shared L1 and/or L2 cache. Additionally, depending on the start points, the HT siblings may experience false-sharing and evict each other’s hot cache-lines. This is not an efficient situation.

Moving on from the best settings for the original tiled code, this author chose a different tactic to perform tiling.

Hyper-Thread Phalanx

The Hyper-Thread Phalanx is introduced by changing the outer computation loop such that the HT threads within each core process neighboring z cells in the y direction and down x (vector by vector). The technique for doing so will be discussed later, but first let’s examine the reasoning behind this choice of access pattern. For this specific computational loop, the Hyper-Thread Phalanx design permits higher L1/L2 cache hit ratios.

Expanding upon the prior illustration of the activity of a single thread we will look at an illustration of 2-wide, 3-wide, and 4-wide Hyper-Thread Phalanxes:

The left image of each pair, illustrates the cold cache penetration of x. Yellow indicates one of the threads incurs a cache miss, while all the adjacent threads accessing the same cell experiences a cache hit. More importantly, moving on to the next drill down of x (right illustration of paired illustrations), we can now estimate the cache hit ratios: 10/14 (71.43%), 16/21(76.19%), and 22/18 (78.57%). These are significant improvements over the single thread layout value of 3/7 (42.86%).

A second benefit is that each of the HT siblings within a core now shares all of L1 and L2 of that core. Meaning this effectively triples the thread’s working set in these caches over the prior technique of maintaining spatial separation of the work performed by each thread within a core.

A third benefit is elimination of intra-core thread cache evictions due to false sharing.

The programming changes to implement this were a little out of the ordinary for the OpenMP programmer. To provide for the use of Hyper-Thread Phalanx you remove the #pragma omp for collapse(2) code structure, and insert hand partitioning of the iteration space by core and by HT sibling within core.  This amounted to removing 5 lines of code and inserting 13 lines of code.

The question then becomes:

How to determine thread binding to core and HT within core?

I will let you think about this while you wait for part 2 of this blog.

Jim Dempsey
QuickThread Programming, LLC

For more complete information about compiler optimizations, see our Optimization Notice.




The throughput expression was that provided in the original source code of Jeffers and Reinders. I left that calculation as-was such that the throughput presented by my revisions was representative of the throughput presented by the original code. Had I changed the expression for my code, the reader would have cause for claiming 'foul'. I brought this issue to the authors. I am a little fuzzy with respect to the answer after this long of time. But I think the "*3.0" was reflective of the clock speed of the particular CPU. Possibly 3GHz for host, and 1GHz for MIC. And the result was to be construed as bytes of answer/second (final results written).

The metric of throughput, is subject to misinterpretation, and should be used carefully in you analysis. A program can exhibit higher throughput yet produce results in a longer period of time with more data read/written faster per byte than alternate algorithm requiring lesser data read/written slower/byte and producing same (equivalent) results..


Roger P.'s picture

Hi Jim,

I was wondering if you would clarify the 'Throughput' calculation:

  double thput = (nx * ny * nz) * sizeof(REAL) * 3.0 * count
      / elapsed_time * 1.0e-09;

In Naoya's diffusion calculation - I understand the flopage calc but not the 'Throughput part.



Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.