The Chronicles of Phi - part 5 - Plesiochronous phasing barrier – tiled_HT3

For the next optimization, I knew what I wanted to do; I just didn’t know what to call it. In looking for words that describes loosely-synchronous, I came across plesiochronous:

In telecommunications, a plesiochronous system is one where different parts of the system are almost, but not quite, perfectly synchronized.

In any large threaded application, synchronizing all threads by way of thread pool wide barriers results in significant waste of processing power. In the diffusion simulation program, as well as possibly your application, the simulation integrates multiple time intervals before doing something with the results. In the example program, it is the entire simulation run time. In a typical simulation program, the simulator would be configured to advance N intervals, then log or display the current state. And then proceed for another N intervals, etc.

In the tiled_HT2 code, small model (256x256x256), and 4 threads per phalanx, we have 64 squad slices across z by 256 steps along y. This represents 16384 squad “drill holes” along x. Each core (60), partitions out: 16384/60 (273.0666…) squad “drill holes” along x. Not being a whole number, all cores are not assigned the same number of “drill holes”. 56 of the cores are scheduled 273 “drill holes” and 4 cores are scheduled 274 “drill holes”.  The work disparity is small, only 1/273. “The cost of doing business” so to say.

If only “The cost of doing business” were that simple. In practice, with 240 threads in 60 cores, you will not get all threads to start a parallel region at the same time, nor will all threads finish the region at the same time even if they are notionally performing the same amount of work. As mentioned earlier in this article, adding instrumentation to measure the total thread time in do-work section and at-barrier section, for the small problem (256x256x256) indicated approximately 25% of the computation time was spent at the barrier. As a conscientious programmer, the “The cost of doing business” as too high

Now then, what to do about lost time at barrier?

Constructing a simplified diagram for a two core, 4 threads/core, 16x16x16 array (ignore vectors for the purpose of this illustration). In the tiled_HT2 perfect world with zero skew at the start of a parallel region and identical compute times (resulting in zero time at the barrier), the time when the first thread reaches the barrier would look like this:

Blue is core 0 (threads 0:3 across width of each stripe) and green is core 1 (threads 4:7 across width of each stripe). Ideally with both core starting at the same time and finishing at the same time.

The real picture is more like this:

Where the white cells are the yet to be processed cells.

In the above depiction, the first thread to reach the barrier (green lower right corner) waits for the remaining threads. The second thread reaching the barrier, additionally wastes time waiting for the other threads, and so-on. The more out of synch the threads get, the longer the wasted burn time at the barrier.

In an extreme case, the situation can even look like:

Where one of the threads (one of the blue core threads) has yet to begin performing its work when the first thread (one of the green core) reaches the barrier. This situation is usually due to some other process running on that hardware thread. The O/S may be doing some housekeeping too.

In either the normal case, or unusual case, significant time is wasted at the barrier.

Let’s do something to recover this wasted time.

The technique used is to assign the cores to a slice that is one count of z (in phalanx-wide groupings of 4 in this case), and ny counts of y (as well as nx of x).

The blue and green represents the work area assigned for a first pass by each core (of the simplified 2-core 4-HT/core setup).

When the first thread finishing work state may look like:

Now, by using a core barrier, in place of thread pool barrier, when the sibling threads of the core who’s thread finished first, finish, the state may look like:

At this point, instead of waiting for all the threads of the thread pool to reach the barrier (all 4 threads of blue in this simplified diagram), the first core completing the core barrier can pick the next available z, and can begin processing before the other core(s) reach the barrier. At the time core 0 completes the core barrier, the state may be:

The process of: As a core finishes, pick next z, continues in a modulus counter fashion (0 follows nz-1). The modulus counter contains the number of overflows. In actuality the next z picks the next nHT's per core of z.

The main benefit of the plesiochronous barrier is realized at the moment when, and where in the code, the traditional thread pool barrier appears. This is at the end of the whole array iteration (often called frame).

In the original OpenMP program this was where the end of the #pragma omp parallel for implicit barrier was located, and in our revised tiled_HT1 and tiled_HT2 program where the explicit #pragma omp barrier was located.

The plesiochronous barrier technique permits the cores completing one frame, to enter the next frame provided that the dependencies are satisfied:

In the above, grey indicates dependencies are fulfilled (prior iteration completed from perspective of picking thread). The right most stripe of blue illustrates a core completing a frame (no more un-chosen stripes to right of green), then blue core passing through frame barrier while green core has yet to reach the frame barrier. Due to the next column (being first column) having reached the state of frame-1 (for core), and the column to the right (2nd column) having reached frame-1 or frame of blue core's current frame (left column time state), the blue core was permitted to enter the column.

CAUTION: A caveat to this is a thread cannot progress unless the prior phase of the picked z, and its adjacent z's (z-1, z, and z+1) are at least current phase-1. If (when) this is not true, then progressing might potentially use data that is not up-to date. This is where the plesiochronous barrier is placed.

Had the core processing the second column in the prior chart not finished, then the core (upper left blue) would have had to wait at the plesiochronous barrier.

In the above, the blue core has allocated the first column for the next frame, yet has to wait at the plesiochronous barrier for the magenta core to finish its column. The magenta core being a hypothetical 3rd core injected into the diagram in order to present the plesiochronous barrier wait scenario.

The plesiochronous barrier technique should exhibit the best improvement when the number of core columns (Hyper-Thread Phalanx columns) exceed the number of cores (Phalanx’s) by a factor of two or more. In the large model it does (512 / 4 = 128 phalanx columns / 60 cores = 2.1333). This suggest that the large model will benefit more than the small model. In observations of test runs, substantial benefit is observed under all circumstances.

Code exemplifying the plesiochronous barrier technique:

#if defined(__MIC__)
#define WAIT_A_BIT _mm_delay_32(10)
#define WAIT_A_BIT _mm_pause();
void diffusion_tiled_aligned(
... (same as for tiled_HT2)

diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
              REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
              REAL cb, REAL cc, REAL dt, int count) {

  // zCountCompleted[nz] is a shared array indicating the iteration counts
  // completed for the z index. N.B. Each thead processes all [x,y]'s for given z
  volatile int zCountCompleted[nz];
  for(int i = 0; i < nz; ++i)
    zCountCompleted[i] = -1;  // "completed" one before first (0)

  // shared next Phalanx number
  volatile int NextPick = 0;

  // CorePick[nCores] stores the NextPicked'd Phalanx number for core
  volatile int CorePick[nCores];
  for(int i = 0; i < nCores; ++i)
    CorePick[i] = -1;  // initialize to a value known to be less than our next pick

#pragma omp parallel
    REAL *f1_t;
    REAL *f2_t;

    int priorCount = -1;
    int myCount = -1;
    int myPick = -1; // initialize myPick (prior pick for 1st iteration of loop)
    int nSquadsZ = (nz + nHTs - 1) / nHTs; // place squads across z dimension
    for(;;) {
      if(myHT == 0) {
        // team member 0 picks the next Squad
        CorePick[myCore] = myPick = __sync_fetch_and_add(&NextPick, 1);
      } else {
        // other team members wait until pick made by member 0
        while(CorePick[myCore] == myPick)
        myPick = CorePick[myCore]; // pick up new pick
      } // myHT != 0

      myCount = myPick / nSquadsZ; // determine count interval for myPick
      // see if iteration count reached
      if(myCount >= count)
        break; // exit for(;;) loop

      // determine which buffers are in and out
      if(myCount & 1)
        f1_t = f2;
        f2_t = f1;
        f1_t = f1;
        f2_t = f2;

      int z0 = (myPick % nSquadsZ) * nHTs;// home z for 0'th team member for next squad
      int z = z0 + myHT;          // z for this team member
      int y = 0;
      // assure we are within z
      if(z < nz)
        // perform plesiochronous barrier
        priorCount = myCount - 1;
        if(z) // then there is a z-1
          while(zCountCompleted[z-1] < priorCount) // wait for z-1
        while(zCountCompleted[z] < priorCount)     // wait for z
        if(z + 1 < nz) // then there is a z+1
          while(zCountCompleted[z+1] < priorCount) // wait for z+1
        int x = 0;
        int c, n, s, b, t;
        c =  x + y * nx + z * nx * ny;
        n = (y == 0)    ? c : c - nx;
        s = (y == ny-1) ? c : c + nx;
        b = (z == 0)    ? c : c - nx * ny;
        t = (z == nz-1) ? c : c + nx * ny;
        // compute nx by ny cells for picked z
   &f2_t[c], // aligned
   &f1_t[c], // aligned
   &f1_t[c-1], // unaligned
   &f1_t[c+1], // unaligned
   &f1_t[s], // aligned
   &f1_t[n], // aligned
   &f1_t[b], // aligned
   &f1_t[t], // aligned
                        ce, cw, cn, cs, ct, cb, cc, nx, ny);
        // Inform other threads that this [z] column is complete
        zCountCompleted[z] = myCount;

        // perform equivilent of Core barrier
        int zEnd = (z0 + nHTs < nz) ? z0 + nHTs : nz;
        for(int i = z0; i < zEnd; ++i)
          while(zCountCompleted[i] < myCount)

      } // if(z < nz)
    } // for(;;)
  } // parallel

Now let’s see what happened:

This shows an additional 20% gained for small solution. For the large solution it recovered the expected gains missing from the HT1 and HT2 code changes.

At this point of optimization, we are now attaining 14x to 15x the performance of a simplified OpenMP approach to the problem. And 45% faster than that attained in the example from chapter 4 of Intel® Xeon Phi™ Coprocessor High-Performance Programming, by Jim Jeffers, and James Reinders - Morgan Kaufman publications.

What else can be done to improve performance?

This article does not have a next part to cover the next phase of the programming enhancements such as guided prefetch and clevict #pragmas, as well as inserting compiler directives to hand align code. I will leave this up to the reader.

At this point, it would be appropriate to consider what happens when problem size is increased to the point where the memory capacity of a single Xeon Phi is insufficient. The plesiochronous barrier technique you learned here can equally apply to larger problems that do not fit within the memory constraints of a single coprocessor.

If there is sufficient interest in having me continue this column to include dual coprocessors (same system) I will do so. Keep in mind that this would be a fair amount of effort.

Please post your comments.

The source code and make file can be found attached.

Jim Dempsey
QuickThread Programming, LLC



File ChroniclesOfPhi.tar111 KB
For more complete information about compiler optimizations, see our Optimization Notice.


jimdempseyatthecove's picture

Is the above function a practical inside an application .OR. is it for benchmarking purposes?

If for benchmarking purposes, then you would spend no more effort on this than "rudimentary" #pragmas.

On the other hand, if this is for production code, then the effort in time to optimize will pay off.

Assuming this is production code.

How large are M and N on average? (#pragma loop count agv=80 is not necessarily a true indication of the average size of N).

If N is larger than the L1 cache requirements for x[] and z[], then I would experiment with teaming the 4 HTs within a core, together with the (used) threads(cores) within the MIC. In the HT3 example you see how an array of volatile int's are used for each thread to observe the progress of the other threads. I suggest experimenting with a similar approach in your test program.

Note, the init code in the example assigns logical core numbers 0:nAvailableCores-1 which need not necessarily map to the physical cores 1:1 and need not use all cores (reserving 1 for MPSS).

HT0 of logical core 0 would start the first pass at x[0]. It would perform the first pass in a nested loop, where the inner loop performs one cache line or a to be determined smaller set of cache lines of work. After it completes the one (or small set) of work in the inner most loop, it publishes its progress into the volatile variable observable by the other threads. The next level of the nested loop then advances towards N by the length of the inner most loop. IOW HT0 of logical core 0 processes all of [0:N) with periodically publishing its progress.

HT1 of logical core 0 performs what would have been the 2nd pass of the for(m= loop. However, prior to proceeding it observe the progress variable for the worker of the prior iteration. In this case the volatile progress variable published by HT0 of logical core 0. When the prior worker (HT0 of logical core 0) has not progressed pass the output position for its (HT1 logical core 0) iteration, then it stalls (use _mm_dekat_32(10) on MIC or _mm_pause() on host).

Note, you can also aggregate the logical HT and logical core (as produced by the init code) into a logical processor. This may simplify the loop control variables and index into the logical processor table,

*** which is not necessarily the same as the O/S logical processor *** e.g. if the O/S is using logical processor 0 for MPSS then the application would (may) want to avoid using the O/S's interpretation of a logical core 0, and instead use the example's init function mapping of a logical core 0 (which in this case is really core 1).

Effectively what you are doing is producing a pipeline where HT siblings are adjacent filters.

Note, after each thread completes the [0:N) pass, it atomically obtains the next m (see HT3 source for example).

I hope this explains the objective sufficient enough for you to work up a test program.

Jim Dempsey

Vladimir G.'s picture

Thanks Jim, will try the altered makefile, taking a break till Monday.

Took a look at the botlenecks again, and now the algorithm reduces to calls to someting like this in the more critical part:

void func(int M, int NN, int N, float a0, float a1, float *__restrict x, float *__restrict z)
	// M, NN, N and all other variables NOT known at compile time, and can be safely assumed as external
// input within this function
	__assume_aligned(x, 64);
	__assume_aligned(z, 64);

	int m, i;

	for (m = 0; m < M; m++)
			--N; //experimenting with different options here - one of them is making N a multiple of 16 and using __assume(N%16==0) to avoid residual loop, padding array appropriately
#pragma loop count avg=80 //presumably, does not help Phi, since it has only zmm instructions, but on a Xeon it //helps to generate 256-bit AVX instead of 128-bit SSE instructions (without #pragma), since N is not known
 //at compile time
#pragma simd
			for (i = 0; i < N; i++)
				x[i] = fmax(a0 * x[i] + a1 * x[1 + i], z[m + i]); //didn't try assigning x to 
                                                  //itself with SIMD yet - hopefully, no issues

And a bunch of other vectorizable places where just a linear combination of some x_k[i] and x_k[1+i] are assigned to y[i], similar to your inner loop.

jimdempseyatthecove's picture

Crap, paste from Word removed the strike-throughs

Remove the first section of the ifeq (including the ifeq and else), and remove the endif


jimdempseyatthecove's picture

1. Makefile didn't run on my Win 7 x64 Pro - I will compile the 32 pieces manually

Why not edit the make file. I think all you will need is to cut out the Linux portion of the header

ifeq ($(windir),)

CC = icc

CFLAGShost = -openmp $(OPT) -std=c99 -vec-report=3

MICLIBS = /opt/intel/composerxe/lib/mic

MIC_OPT = -mmic

HOST = -xHost

OUT = "-o"


CC = icl

CFLAGShost = -openmp $(OPT) /Qstd=c99 /Qvec-report3

MIC_OPT = /Qmic

HOST = /QxHost

OUT = /Fe


2. Could you please elaborate on how to better accomplish the shifted y write at/near the time y itself is written?...

Before we (you) go through the gymnastics it may be pertinent to look at the arena. By this I mean to look at the code you use now before we make improvements.

Yes, array is always shifted, but shifted version is not re-used, but despite small count, this particular and few comparable loops are a bottleneck (somewhat similar to your example i this sense)

You mean it is used once? In this case it would be very difficult to justify performing the shift by an HT sibling. But for a small section of code, that is a bottleneck of substance, it may be worthwhile investigating alternatives.

The Xeon Phi, with 4-wide HT, leaves plenty of opportunity to experiment. Keep in mind that the shepherding is not free, unless, it can occur during stall time by the other HT siblings. Or conversely, the compute HT siblings can perform their work during the stall times of the shepherd thread. The authors in the book referenced in the blog found using 3 HTs/core yielded best results. One of the improvement attempts I made was to use the 4th HT sibling in the core as a shepherd thread to preload the caches. This did not yield the benefit I sought. Note, I also discovered compiler inserted prefetches decreased performance (the implicit hardware prefetching was sufficient).

The hardware thread scheduler of each cored is round-robin amongst the hardware threads that are active (the O/S null job must issue _mm_delay_32(someNumber) for the hardware thread to yield its instruction time slice in its core). When the three (or two) remaining threads have no stall time, then adding the 4th thread is hardly justified, unless the additional registers buys you time.

3. The compiler optimization guru's are smart and getting smarter all the time. I a amazed at what they can accomplish. Not posted in the blog were the attempts to do just what you suggest: keeping the loops inline with the main code with various #pragmas. In the attempts I made, there still remained the issue of the peel and remainder code. My gut feel is when the code is inline, the compiler optimization are stingy about holding onto registers, and/or the offset calculations (visible to the compiler) induce uncertainty. This may have affected the optimization choices available. Using __assume(... on the index variables may have corrected this. I think what might be of assistance to the compiler is a scoping hint pagma

... code
#pragma heuristics release all
... all resources (registers) available here

or something like that

And for the 'postamble', although not applicable to your optimization case, but in cases when you can afford to go above the array's higher bound (array is padded appropriately, and no data of interest is stored beyond loop's higher bound),


If you look at the code, you will find that padding was performed one cache line before and after the array. The before part was to handle the [i-1] (actually the vector incursion prior to actual data) and the after part for the [i+1]. Additional care was taken to pre-populate the pad data such that when used, it would not produce a Signaling Not a Number (SNAN) when used in the loop.

The __assume(nIterations%N_REALS_PER_CACHE_LINE == 0) may have assisted in removing postamble. Let me know what you find out.

Jim Dempsey

Vladimir G.'s picture

Jim, thanks a lot for your answers! Took some time to comprehend, but extremely helpful.

1. Makefile didn't run on my Win 7 x64 Pro - I will compile the 32 pieces manually. Will take some time to send the results back though (won't have access to the system for a few days).

2. Could you please elaborate on how to better accomplish the shifted y write at/near the time y itself is written? Yes, array is always shifted, but shifted version is not re-used, but despite small count, this particular and few comparable loops are a bottleneck (somewhat similar to your example i this sense). For the relevant part of the calculations, in my case y is not leaving L1 cache, and even having a shifted version of it will not change this pattern, but quite interesting to know what the options are in both within/outside L1 cases.

(to be able to follow your HT suggestion, that sounds quite interesting in many cases, need to do some more homework first...)


"One of the major objectives of using the separate function was to eliminate the preamble (peel) and postamble (remainder) code." 

Sorry to ask another question how to do same thing in 10 different ways, but for my and others' future reference to simplify writing, making scripts more efficient on-the-run with not much effort... Would it be possible to achieve the first goal (no peeling) with #pragma vector unaligned after you inserted __assume_aligned section? Will compiler still take the __assume_aligned() information into account?

And for the 'postamble', although not applicable to your optimization case, but in cases when you can afford to go above the array's higher bound (array is padded appropriately, and no data of interest is stored beyond loop's higher bound), read about -opt-assume-safe-padding ( Is there maybe a way to specify this/similar option on a per-loop basis, instead of for the whole project?

And in a more general case of mostly aligned accesses, when it is known that the loop count is a multiple of elements that fit in the register, again a similar question... would __assume(N%16==0) be of any help here to eliminate the 'postamble' code?  (if CACHE_LINE_SIZE/sizeof(type_used) = 16, of course)

jimdempseyatthecove's picture

>>I would gladly volunteer to run your code on 2P E5-2630v2 if you can give me a Windows version (the only Linux that I have at the moment is on Phi, that is, of course, creating some problems, so might be forced to switch to Linux at some point)

The supplied makefile will run (NMAKE) on Windows (at least it used to).

Let me know if you have issues. My system dual boots Cent O/S and Windows 7 x64 Pro. I'd rather not reboot unless there is an issue. The system has been up for ~6 months.

Also note, the make file on either host system, also makes the host execuitables. This will illustrate how the code optimizations also help on host based platforms.

Jim Dempsey

jimdempseyatthecove's picture


A1 There were two principal reasons for moving the inner loop to a separate function. The first was to aid the compiler in eliminating the, or much of the, loop preamble code (peel code) and postamble (remainder) code. The second was to assist the compiler in selecting which (6) of the array references (8) would be aligned. Using two base pointers and __assume(...%N_REALS_PER_CACHE_LINE==0) and  __assume(...%N_REALS_PER_CACHE_LINE!=0) may also have worked. This would have reduced the number of arguments passed. Someone could try this with the code supplied and report back. I my gut feel is a separate function will work better .AND. the __assume technique could be faster (less args passed) "assuming" the compiler optimized according to your wishes. (thus warranting an experiment).

... What would happen for those cases, when offset_xxx is actually not a multiple of 16?

By using __assume(... you made an explicit contract with the compiler that the offset_xxx is a multiple of N_REALS_PER_CACHE_LINE (and that the base pointer c was also aligned). Your failure to abide by your contract may result in severe consequences.

A2 simd and restrict are not quite one and the same, although in this case they might yield the same. While simd will vectorize the loop, it will/may not affect the preamble (peel) and postamble (remainder) code generation. Whereas restrict will in cases where it can. One of the major objectives of using the separate function was to eliminate the preamble (peel) and postamble (remainder) code.

A3 "y[i] = a0*x[i] + a1*x[i+1];" ... Would it make any sense to allocate some aligned x1 once and in front of every inner loop (i++) do a memcpy of x starting from position 1 to the end into an aligned location *x1

Yes, this is a similar optimization technique as used in Matrix Multiply where you effectively transpose one of the matrices (or subset of columns). In the Matrix Multiply, the benefit is generally larger since your tradeoff is elimination of a gather (as opposed to elimination of unaligned load as in your case). I am not sure how many times you will have to reused the data in your case to find a return in your investment.

Note, one technique to improve on the array shift is when you write the original array, that you also write the shifted arrays. If you were always going to shift them, doing it in the write pass eliminates the copy operation.

On HT system you can team the siblings up and have one sibling HT thread running in advance of the other(s) performing the shifts. Being in the same core, this also primes the data in L1 and/or L2. You would want to coordinate the activities without using formal barriers, rather use progress flags/counters as was done in this article.

Jim Dempsey

Vladimir G.'s picture

I would gladly volunteer to run your code on 2P E5-2630v2 if you can give me a Windows version (the only Linux that I have at the moment is on Phi, that is, of course, creating some problems, so might be forced to switch to Linux at some point)

Vladimir G.'s picture

Jim, thanks a lot for your posts on developer zone's forums! They are very helpful, when looking through a lot of forum posts for an answer to some question.

I have some questions on the part 4 of your blog (i am just a C/C++ beginner, sorry if the answers are too obvious).

Q1. You are introducing void diffusion_tiled_aligned function. Am I right assuming that its main purpose is to let the compiler know that 6 out of 8 variables are aligned? Just in general, would one be able to achieve the same result by specifying (after __assume_aligned section) something like:

s = c + offset_c; n = c + offset_n; b = c + offset_b; t = c + offset_t; 

followed by

__assume(offset_n%16==0); __assume(offset_n%16==0); __assume(offset_b%==0); __assume(offset_t%16==0);

(and doing the relevant corrections for countX-1 after that)

if Cilk array notations are not used? (I saw examples of this approach only in Cilk context). What would happen for those cases, when offset_xxx is actually not a multiple of 16? (will the program safely crash, or start producing wrong results)


Q2. In general, for other code is there a need to pass arguments as '*restrict' if later on #pragma simd is used anyway?


Q3. And a question to which I am trying to find an answer for quite a while... Excellent, huge improvement after you told the compiler that 6 out of 8 accesses are aligned. But what to do with the other 2 that simply cannot be aligned with all the rest simultaneously in a reasonable Phi use-case?

Let's even simplify, concentrating on only 3 accesses out of 8. Assume:

float *const y and float *const x are floats _mm_malloced on 64 byte bounds and padded 64 or more bytes.

a0 and a1 are const float scalars.

__assume(N%16==0); // with the intention of always passing residual loop unmasked as a vector, since padding
                   // allows to do that (if no NaNs are ensured)
__assume_aligned(y, 64);
__assume_aligned(x, 64);
int i;
#pragma simd
#pragma loop count min=16, avg=160, max=2048 // Q4. does it help here???
for (i = 0; i < N; i++)
y[i] = a0*x[i] + a1*x[i+1];

How can this loop be optimized? And deviating from your optimization case from Chapter 4, please further assume that N is relatively small, so that only ~1-2KB of cache would be needed to load complete x and y, and the loop is repeated many times, re-using the same pre-allocated once x and y with different N values, but re-organizing inner-outer loops is not an option due to complex dependencies at each iteration in the outer loop.

Would it make any sense to allocate some aligned x1 once and in front of every inner loop (i++) do a memcpy of x starting from position 1 to the end into an aligned location *x1, so that #pragma vector aligned can be used on top of 

for (i = 0; i < N; i++); {
y[i] = a0 * x[i] + a1 * x1[i]; }

loop? Or the memcpy will require more work than all the benefits of all (8/8 in your case) elements-aligned memory accesses? What else might help to speed up this sub-case?


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.