Questions regarding Performance of Hyper threading on Xeon Phi

Questions regarding Performance of Hyper threading on Xeon Phi


Can anyone answer the following questions?

1. Am I correct if I say

"Although Xeon Phi has 61 cores, till 120 threads the scheduler schedules alternate hardware threads at each alternate cycle. After that its really hyper threads and those threads only got scheduled if other threads running on the same core are stuck on I/O."

Can I get any intuition behind how those hyper threads are scheduled? What is the expected performance benefit of using hyper-threading?


In my job script I used 

export MIC_KMP_AFFINITY=scatter

I know this works for OpenMP threads,

But I am not sure whether that works for cilk plus threads.

If not, what happens for cilk plus threads, 

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

When using 2 threads, the benefits of using hyper-threading is obvious, compared with the case of using a single thread. After that, I think it depends. When using 2 threads already reaches the maximum flops and memory bandwidth, it is meaningless to use more. Otherwise, using more threads on a single core should give some performance benefits. 


They used to tell us that 2 threads per core could at best keep the VPU busy 90% of the time.  Many workloads do in fact gain about that much from a 3rd thread on the core, provided there is no cache capacity problem.  Where there is cache data sharing, it's usually necessary to optimize the affinity setting, e.g. by



The reason for using scatter, before the balanced and PLACE_THREADS options were available, was to spread work as evenly as possible across cores, when there was no good option for 3 per core other than to enumerate all of the threads.

There's not a lot of rhyme or reason to it.  An application I'm working on runs fairly well with 4 MPI ranks of 60 threads and better yet with 6 ranks of 30 threads, the latter using KMP_AFFINITY=balanced.

The compilers are much better than they used to be at realizing the potential of varying numbers of threads per core.

Cilk(tm) Plus doesn't use any KMP environment variables.  Due presumably to the lack of a means to affinitize and share cache, it tends to gain less than OpenMP beyond 1 worker per core, even though some of the VPU instructions are executed at higher rate by alternating workers between cycles.  Its claim to fame is the flexibility to keep cores busy dynamically, particularly where the data volume isn't too large.

Intel Xeon Phi coprocessor core implements a "smart" round-robin multithreading. According to my understanding, this is how the core schedules the instructions: In a cycle N, if the core issued instructions from the hardware thread 0 then in the cycle N+1 the core will try to issue instructions from hardware thread 1, hardware thread 2 or hardware thread 3 in that order. The core cannot issue instructions from the hardware thread in back-to-back cycles. 

You can read more about the multi-threading in the Intel Xeon Phi coprocessor cores in the Software Developers Guide:

Intel® Xeon Phi™ coprocessor implements a "smart" round-robin multithreading.  See the Intel® Xeon Phi™ System Software Developer's Guide for details.

I don't think there is a way to pin Intel® Cilk™ Plus threads to processors.   KMP_AFFINITY applies to OpenMP® threads.  Similar topic:

Thanks all for your reply.


Do you mean that,

If I use 120 threads, in my case the OpenMP program will pin 2 threads to each core and cilk plus with randomly assign those threads to cores and we can not predict whether all of the 61 cores are being used or not?

I have implemented graph algorithms and trying to see their performance on MIC. For some large scare free graphs, these algorithms are scaling till 243 threads linearly.

So I am confused why they are scaling till 243 threads. Does that mean that they are waiting on I/O s?

Any insightful opinion will be appreciated.

Thanks again,

Best Regards,


Many applications perform best using four threads per core.  For example, if the flop-to-byte ratio is large, use all the threads (leaving one for the OS).  Pinning threads to cores is less important when an application uses all threads.

In the original post you said you were setting KMP_AFFINITY=scatter, which would spread the threads out across the cores but not put adjacent threads on the same core. If you use 243 threads, you don't need KMP_AFFINITY or KMP_PLACE_THREADS just to get the work spread out evenly (!), as Gregg said.  I would expect Cilk(tm) Plus also to spread the workers across cores.  What do you see with the micsmc/micsmc-gui visualization, when you turn on the bar graph view of core activity?

No doubt, there may be applications which don't need cache sharing between adjacent threads and may even scale up to 243 workers with Cilk(tm) Plus.  In the case of MKL ?gemm (which does depend on efficient cacheing) it took expert hand coding to get good scaling to 244 threads.


>>So I am confused why they are scaling till 243 threads. Does that mean that they are waiting on I/O s?

The typical runtime environment of Xeon Phi does not exhibit a behavior of software threads migrating amongst logical processors (pinned or not pinned). On a server or desktop, where other processes are running, when a thread is not pinned to a logical processor (or group thereof), these threads are subject to migration by the O/S.

What TimP is saying, by not only pinning threads to logical processors, but rather pinning specific threads to specific logical processors, thus knowing which threads reside in which core, expert hand coding can yield superior results; as in TimP's mentioned example "MKL ?gemm".

In your case, "observing scaling till 243 threads", this is indicative of a) memory bandwidth is not an issue, and b) shared resources within the core is not an issue (SSE, AVX, AVX2 not over utilized). These types of programs have preponderance of non-vector integer calculations that reside in L2 cache or are inter-register calculations. Note, if the data preponderantly resides in L1 cache, then you may see a plateauing effect when adding the 3rd and 4th thread of each core, and new increase when adding the next core.

Additional note, under the preceeding circumstance o fthe data preponderantly resides in L1 cache, when using export MIC_KMP_AFFINITY=COMPACT, the plateauing effect may show as a ripple in your scaling chart. Whereas using export MIC_KMP_AFFINITY=SCATTER you may observe a more prenounced plateauing effect as you reach 1/2 and 3/4 total threads. Also, without looking at total throughput, and looking at the linearity of the line, you may falsely assume the strait line shows better scaling. IOW do not skewer your performance for the purposes of showing linear scaling.

Jim Dempsey

Best Reply

KMP_AFFINITY=compact is problematical as well when not using all the threads, as it packs the threads into the minimum number of cores.  This is the reason for the introduction of KMP_PLACE_THREADS early this year.

When using OpenMP under MPI, KMP_AFFINITY=balanced is useful.  balanced or compact will place adjacent numbered threads on a core, helping with cache locality when they share data.

Thanks all for their responses. These are useful.



Four threads per core can be useful when there is nicely data parallelism with no interdependencies.I think that hardware scheduling algorithm will at least schedule two threads per clock cycle utilizing Pipe0 and Pipe1 and round robin algorithm will search through the hardware threads instruction trying to keep busy VPU ,x87 and ALU units by scheduling to execute thread 2 thread 3 and thread 4.

Leave a Comment

Please sign in to add a comment. Not a member? Join today