When plotting scaling graphs for Intel processors which support multiple hardware threads/core, NEVER use an X-axis labelled “Threads”, but always use “Cores”, and plot separate data for 1thread/core, 2threads/core and, on Intel® Xeon Phi™ processors, 4threads/core.
We often show scaling graphs for our parallel programs to try to understand the limits of their performance, and how well they are working on a given machine. A classic way to do this is to show a “Speedup” graph as we increase the hardware resources that are allocated to the program.
The critical issue for us here is "What should be plotted on the X axis?"
OpenMP tempts us to use “Number of threads”, since it provides the OMP_NUM_THREADS environment variable that makes it easy for us to control the number of threads from the script we’re using to run our experiments. However, if we’re testing a processor which has hardware support for more than one logical CPU per core, the number of threads is the wrong thing to be using as our X axis.
This is because not all threads are equal. If we run two threads placing one on each of two cores, we’re using two whole cores and their private caches, whereas with two threads on a single core we're only using one core and one set of caches. The first case can therefore use twice as much hardware as the second. We’d certainly hope to get better performance by using twice as much hardware.
A secondary effect is that the amount of cache that a thread can use without risking interference from another thread depends on where it is placed. In Xeon cores, the level 1 and level 2 caches are private to the core (in the Intel® Xeon Phi™ X200 series (Knights Landing) the L2 cache is shared between the two cores in the same tile). Therefore a thread which is running alone on a core effectively has twice as much L1 cache as one which is sharing a core with one other thread.
Even on Xeon processors it's therefore critical to distinguish between the one thread/core and the two thread/core cases.On Intel® Xeon Phi™ processors, where we have up to four logical CPUs/core, things can get even more confusing. Consider a “60 thread” measurement.
- 60 cores at 1 thread/core?
- 30 cores at 2 threads/core?
- 20 cores at 3 threads/core?
- 15 cores at 4 threads/core?
Each of those will have a different performance.
But... How can I collect results like that?
If you’re using the Intel OpenMP compilers, it’s easy, since we provide an environment variable to allow you to restrict the hardware used by the OpenMP program.
It’s called KMP_HW_SUBSET1. By setting KMP_HW_SUBSET=1t you restrict your code to one thread/core. You can then use OMP_NUM_THREADS to choose the number of threads to execute, or you can do both things in one go by saying KMP_HW_SUBSET=24C,1T (run 24 threads, each on a separate core). Using this second form is simpler, since it avoids the danger of explicitly imbalancing the allocation (consider KMP_HW_SUBSET=2t OMP_NUM_THREADS=3, which necessarily has one core with one thread and one with two). By specifying the number of cores and threads/core together in KMP_HW_SUBSET you can’t achieve that imbalance.
Similarly you can do KMP_HW_SUBSET=12c,2t (run 24 threads on 12 cores with two threads/core)
Or (on Xeon Phi processors) KMP_HW_SUBSET=1s,6c,4t (run 24 threads on 6 cores with four threads/core inside a single socket).
If you are working with multi-socket Xeons, you need to limit sockets as well as cores and threads, since the default behaviour of KMP_HW_SUBSET is to treat unspecified qualifiers as meaning "all of them". That is convenient since you can use KMP_HW_SUBSET=1t or KMP_HW_SUBSET=2t to run on all cores in the machine, but more confusing when KMP_HW_SUBSET="2c,2t" gives you eight cores in a 4 socket machine!
Note that in all of these cases you are guaranteed that you have the same number of threads on each core, and are thus ensuring that all of your threads will naturally run at the same speed.
If you simply use OMP_NUM_THREADS and KMP_AFFINITY without also using KMP_HW_SUBSET, you very rapidly start measuring cases where the number of cores being used is hugely different between the two affinities. For instance if I use OMP_NUM_THREADS=48 KMP_AFFINITY=scatter on a 60 core Intel® Xeon Phi™ processor, I’ll be using 48 cores each with one thread, whereas if I use OMP_NUM_THREADS=48 KMP_AFFINITY=compact, I’ll use 12 cores with 4 threads on each. It’s unsurprising that the affinities therefore show very different performances for "48 threads", since one of them is using four times the hardware resources of the other!
You can (and probably should) still experiment with an affinity setting to choose how the threads are enumerated inside your OpenMP code when you use KMP_HW_SUBSET, because this can make a difference. However by ensuring that you use a consistent set of hardware resources with each affinity type the difference you see will be the real one (what threads are operating on data which is nearby in memory, so can constructively share a cache), not the spurious one shown above, where "scatter" affinity is using four times the hardware resources of compact.
Note 1: In earlier compilers the environment variable was called KMP_PLACE_THREADS. The name was changed to KMP_HW_SUBSET since the effect is not to place threads, but to restrict the hardware used by the code.
Documentation on KMP_HW_SUBSET (scroll down or search in the page to find it).