I am trying the example from chapter 4 of the "High Performance Programming for Intel Xeon Phi Coprocessors" book (lotsofcores.com). I am running the most optimized version of the program that the authors present.
Executing the program with different number of threads and plotting the number of flops gives an interesting result. The Phi version I use is a 57 cores version (3110 IIRC).
The performance gain/thread seems to diminish slowly, as expected. However, for 228 threads, there seems to be a performance boost that doesn't follow the rest of the data points. I repeated the experiment a couple of times and it gives consistent results. Do you have any idea what could cause this ?
I run the program with KMP_AFFINITY=scatter and the appropriate OMP_NUM_THREADS.