OpenMP scaling for PHI

OpenMP scaling for PHI


I've got a very long running simulation which has a very simple OpenMP structure: Some serial setup and then a bunch of threads start computing, only interacting when they write to a standard output array which is managed by thread 0.  Fortran, windows, OpenMP part is off-loaded and very little data comes back (and the amount of data does not vary with number of threads).

I attach a plot which shows the productivity of the same code on three machines: a pair of E5s, and I7, and a Phi.  They all access an 80 meg data set for testing (latter, 6 Gig).


As you can see from the plot, the I7 is better than the E5s but has the same slope.  From the E5 testing up to 32 threads (= all it has), the scaling is well behaved (the bump at 12 threads is real but don't care).  So we were surprised at many things about the MIC behavior.  First we expected it to be straight to 56 threads (we chose, for this plot, to allocate the cores first through 56, then linearly added the threads).  Letting the PHI chose the affinity did no good.

According to Vtune, all the threads are full bore.  What is especially odd is the the slope becomes good after we start assigning the third thread per processor but too late, the damage has been done.  The terrible bump starts at about 8 threads, each on its own cpu.  Is this some issue with the cache structure?  All the threads are constantly accessing the same shared data set all the time.  If so, why doesn't it just continue to have a bad slope?  Or again, has all the damage been done?

Any ideas where to look, folks?

Downloadimage/jpeg Intel comparison plot.jpg303.02 KB
6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The MIC architecture with respect to hardware threads is different from the host processor and HT's

What you want to do on the MIC for scaling test is with affinity pinning schedule two threads per core before moving on to the next core


0:0, 0:1, 1:0, 1:1,...55:0, 55:1

then 3rd thread per core

0:2, 1:2, ... 55:2

then 4th thread per core

0:3, 1:3, ... 55:3

If you look at your chart you will see rather poor scaling until you reach # cores # threads, then it increases.

What this shows me is your thread scheduling is 1 per core first (scattered).



Then vary your thread counts from 1 to 112

KMP_PLACE_THREADS=   (remove environment variable)

Then vary your thread counts from 113 onwards

Jim Dempsey

As you expect your threads to share cache, it's important to set KMP_AFFINITY=compact or OMP_PROC_BIND=close, as Jim mentioned, when running more than 1 thread/core.  The easy way to test 1, 2, 3, or 4 threads per core is by the KMP_AFFINITY variations.

KMP_PLACE_THREADS=56c,2t will set 112 threads unless you over-ride, likewise 56c,3t sets 168 threads.

If your application sees a reduced scaling beyond 8 cores, it could be a memory bandwidth question; cache locality might help.  The Xeon models vary on how effective L3 cache is in saving you from needing to consider cache locality, and of course it varies a great deal with application characteristics.


Hi Jim & Tim,

Intel Forum once again (now 3 times) lost my response.  You'd think by now I'd know enough to save it before hitting the submit button.  I'm going to be outa town for a few days but I'll be trying these hints as soon as I get back.

Yes, we did apply them one thread to a core to start with for the plot as we assumed that that would be the best way to go. (Although we noted that as balanced, which is what I think you are suggesting; we'll have to double check that.)  We started with the default affinity (is that scattered?) which had strange behavior: some threads taking twice as long as others at certain ranges (like in the 32 - 64 thread range but not afterwards until somewhere over 200 threads, where it started again.)

Cache locality is difficult.  The 3-D data are only local in one dimension and that will fail as soon as the stochastic threads diverge.  I've been worried about the delays in the L2 interconnect cache ring.  But I find it hard to believe that it would start to become saturated at only eight threads and that this would continue to get worse until 112 threads where, I suppose, it couldn't get any worse and the performance returns to scaling.  If this is true for the test data set, I'm hosed, especially when the real, full memory, data set is used.  Then the PHI just becomes a training tool for a hopefully better Knights Landing with a better interconnect scheme.

I suppose one question is: does the affinity matter if we run all possible threads?  If the answer is no, then even breaking the problem into parts (which we can do) will not help if a run, in no matter how many parts, uses all cores.


On Xeon Phi you cannot use the generic/shorthand compact, scatter, balanced for running benchmarks from 1 to n threads. This might be  time to add a new KMP_... to indicate at one time, fill up number of requested threads 2/core first, then 3/core next, then 4/core last. KMP_PLACE_THREADS cannot do this with one set value.

If on the other hand you intend to run your application with all cores and with as many threads per core that produce the best performance, then use scatter and run three tests: 112 threads (2x cores), 168 threads, and 224 threads. Also note, in offload mode you may want to reserve one to the cores. Once you determine threads/core, then use the KMP_PLACE_THREADS to specify number of threads/core and then vary your thread count. What you are looking for is with different numbers of threads, the volume of data per thread changes. The two in combination affect data alignment (false sharing evictions) and memory bandwidth demands. You may find a sweet spot that is not the full complement of cores for thechosen number of threads/core.

>>Cache locality is difficult.  The 3-D data are only local in one dimension and that will fail as soon as the stochastic threads diverge.

This can be problematic, however, often with more code and rearranging the data, cache locality can be improved.

BTW, your first priority is to get the vectorization working well.

Jim Dempsey

Yes, we have 57 cores and we treat it as a 56 core machine.  Although Vtune shows that the last core is not all that busy, actually but, at this point, this is not a major issue.

I intend to have an intern work on the vectorization issue this summer from the ground up.  Vectorization of this code is a major job as the basic design is from many years ago where I believed that the long term solution was having many computers combining results.  So going to OpenMP was very easy, once we understood open MP and is why, on the Xeons, it scales so well.  But the heart of the core process is a series of Markov chains with dynamically changing criteria competing against each other.  We did make one run at vectorization but it performed with the same efficiency (loss) on machines of different vector register sizes so we're going to start from the ground up to first fully understand alignments and vectorization which, if we did, we would be getting different gains (err, losses) on different machines.  Only then can we contemplate a major redesign of this central core part of the program.

Oddly, right now, I can get better absolute performance out of a couple Xeons than a PHI board but PHI is cheaper.


Leave a Comment

Please sign in to add a comment. Not a member? Join today