Interesting performance graph

Interesting performance graph


I am trying the example from chapter 4 of the "High Performance Programming for Intel Xeon Phi Coprocessors" book ( I am running the most optimized version of the program that the authors present.

Executing the program with different number of threads and plotting the number of flops gives an interesting result. The Phi version I use is a 57 cores version (3110 IIRC).

The performance gain/thread seems to diminish slowly, as expected. However, for 228 threads, there seems to be a performance boost that doesn't follow the rest of the data points. I repeated the experiment a couple of times and it gives consistent results. Do you have any idea what could cause this ?

I run the program with KMP_AFFINITY=scatter and the appropriate OMP_NUM_THREADS.



6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I suppose you're using MKL library functions which are hand coded for peak performance when using all threads.

KMP_AFFINITY=scatter (or equivalent OpenMP setting OMP_PROC_BIND=spread) is probably a reasonable setting, if you aren't willing to optimize KMP environment for each number of threads.

Alternatively, you could control the number of threads by KMP_PLACE_THREADS, using OMP_PROC_BIND=close.  You might then find a few settings above your quoted curve, I guess at settings such as


or 55c,3t, for example.

The code does not make explicit use of the MKL, but just like loops are recognized by the compiler and vectorized, there could be other structures that end up optimized for when using all cores.

If you can get VTune to work it may shed some light on the issue. I was unable to get VTune working for when I retuned the Chapter 4 programs for a blog series going up on the IDZ Blogs. At the top of this page click on Resources | Blogs, then look for part 1 of the 5-part blog, series titled The Chronicles of Phi.

The book's testers had a system with 61 cores, and they found best results using 3 threads per core (183 threads), your system with 57 cores and 4 threads per core (228 threads). A pure speculative conjecture would be the books use of 244 threads experienced cache line evictions where your 228 did not. (or not as much as).

My system has the 5110P (two of them), with 60 cores. When using one of them, the results followed the book's experience, though for some reason my 5110P had slightly better performance. The book's authors used a preproduction  system.

In my blog (5 parts) I will show you some new techniques (different from book) to boost the performance of this application. 

Jim Dempsey


The choices for thread counts in the performance graph are almost certainly not optimal for a 57 core system due to load imbalance.

With KMP_AFFINITY=scatter, you probably want to try 57, 114, 171, and 228 threads.  These will give 1,2,3,4 threads per physical core.

You will probably get more consistent results if you use KMP_PLACE_THREADS as suggested above to leave one core free for miscellaneous OS activity.  Then you would want to use 56, 112, 168, and 224 threads.

I often find that using all of the cores gives very slightly better "best case" performance, but with significantly more performance variability than using N-1 cores.

"Dr. Bandwidth"

Oh ok that makes sense. If one core has two threads assigned and all the other ones only one, those two threads will take longer to finish their work and this will impact the total time.

Thanks for the advice.

Leave a Comment

Please sign in to add a comment. Not a member? Join today