In my previous article, “Where are my threads?”, I briefly described the usage models for the Sample Over Time (SOT) feature of Intel® VTune™ Performance Analyzer. I also explained how VTune analyzer can help to break down the events sampled per core/processors. One of the key benefits of SOT that I highlighted is how it can help detect scheduling issues. In this article, I will explain this benefit in more detail.
Figure 1 shows the SOT view of a multi-threaded application that is threaded using OpenMP (http://www.openmp.org). This particular application was executed and analyzed on an Intel Mobile Core 2 Duo T7200 based system. I chose to use RS_UOPS_DISPATCHED.CYCLES_NONE (aka stall cycles) instead of the CPU_CLK_UNHALTED.CORE (aka clockticks) event throughout the article in order to get you familiar with stall cycle event.
Note: CPU_CLK_UNHALTED.CORE ~= RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE
On Core 2 architecture, cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event.
From Figure 1, we can easily tell that this particular application is running almost identically on each core. However, what we don’t know is how the threads are scheduled on the two cores available. Selecting the Thread view when the CPU button is also selected will give us insight into the execution and scheduling of these threads (Figure 2). A closer look at thread9 and thread59 and how they are executed on the cores as shown in Figure 2 reveals how the operating system (OS), Windows XP Sp3 in this particular case, is scheduling the threads on both cores and that each thread is running almost the same amount of time on each core.
You can zoom in to any region on the timeline by selecting the region of interest with the mouse and then selecting "Zoom In" in the context menu (right-click menu). Figure 3 which shows the zoomed region (0-1.8secs) reveals how threads are actually tossed back and forth between the cores. OS scheduler simply doesn’t keep the threads on the same core (i.e: Thread9 on Core 0 and Thread59 on Core 1 or vice versa). This particular scheduling pattern might not be an issue for such a system since the cores share the same 2nd level cache, but for multi-socket systems, such a scheduling pattern will be a problem.
At this point, it is important to introduce a few concepts such as thread affinity. Thread affinity restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Depending on the topology of the machine, thread affinity can have a dramatic effect on the execution speed of an application. However, you must have a good reason and be cautious before interfering with the OS scheduler's ability to schedule threads effectively across processors/cores. Most recent operating systems and their schedulers have improved significantly; generally speaking, modern schedulers will perform efficiently.
The Intel compiler’s OpenMP runtime library has the ability to bind OpenMP threads to physical processing units. Thread affinity is supported on Windows OS systems and versions of Linux OS systems that have kernel support for thread affinity. There are three types of interfaces you can use to specify this binding, which are collectively referred to as the Intel OpenMP Thread Affinity Interface. The affinity types supported with Intel OpenMP runtime library are: none (default) / compact / disabled / explicit / scatter.
After setting the affinity
For this exercise and for this particular system, setting the affinity as scatter or compact won’t make any difference. Please see the information given in the link above for more details.
set KMP_AFFINITY=scatter or set KMP_AFFINITY=compact
By setting an affinity type, we are changing the default behavior of Intel compiler’s OpenMP runtime library. By default, runtime library does not bind OpenMP threads to particular thread contexts so that OS scheduler freely decides how to schedule each thread.
After setting the affinity and running the application under VTune analyzer, we get the SOT results as shown in Figure 4. Comparing Figure 4 and Figure 1 (where no affinity was set) shows no significant differences; the two results look almost identical. However, the difference lies in the details. Zooming in on the timeline shown in Figure 4 will give us an idea of what has changed from the previous run.
Figure 5 shows that both threads 17 and 64 remained on the same cores on which they were initially scheduled. Thread17 initially got scheduled to run on Core 0 and Core 1, but it stayed on Core 0 for the remainder of the run.