Identifying OS thread migration using Intel® VTune™ Amplifier XE

What is thread migration? 

Today’s complex operating systems use a scheduler to assign application threads, also called as software (SW) threads, to processor cores. The scheduler may choose the placement of the application threads on the physical cores depending on a number of different factors such as system state, system policies, etc. A SW thread may execute on a core for some period of time before being swapped out to wait. A SW thread may have to wait for a number of reasons, such as being blocked for I/O. If available, another SW thread may be given a chance to execute on this core. When the original SW thread is once again available to execute, the scheduler may migrate the thread over to another core to ensure timely execution. This poses a problem to the newer computing architectures as this SW thread migration disassociates the thread from data that has already been fetched into the caches resulting in longer data access latencies. This problem is further amplified in Non-Uniform Memory Access (NUMA) architectures, where each processor has its own local memory module that it can access directly with a distinct performance advantage. In a NUMA architecture, when a SW thread is migrated to another core, the data stored in the earlier core’s local memory becomes remote and memory access times increase significantly. Hence, thread migration can hurt performance making it important to identify if it is occurring in your application.

How to identify thread migration 

Intel VTune Amplifier XE can be used to identify SW thread migration in applications running on Intel architectures. You can either use the Intel VTune Amplifier XE GUI or the command line interface to identify thread migration. To identify OS thread migration, run a hotspot analysis or lightweight hotspot analysis on your application. 

You may choose to view the results using the Intel VTune Amplifier XE GUI. To identify thread migration using the GUI, switch the grouping to “Core/Thread/Function/Call Stack”. In general, you will want the total number of threads to be less than or equal to the total number of hardware threads supported by the CPU. In addition to this, you want the threads to be equally distributed across the cores. Hence, if you see more than the expected number of SW threads under any core in your result, there is thread migration occurring in your application. In the above example, we have 8 OpenMP worker threads instead of 2 threads (since this is an Intel® Xeon® processor), executing on Core_6  This indicates thread migration. 
Alternately, we can view these results directly from the command line by using the following: 
$/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -group-by thread,cpuid -report hotspots -r <results_directory> -s "H/W Context" -q | less

Command Line Interface:
Thread                          H/W Context  CPU Time:Self
------------------------------  -----------  -------------
OMP Worker Thread #5 (0x3d86)    cpu_0                0.004
matmul-intel64 (0x3d52)          cpu_1                0.013
OMP Worker Thread #15 (0x3d90)   cpu_10               2.418
matmul-intel64 (0x3d52)          cpu_10               2.023
OMP Worker Thread #8 (0x3d89)    cpu_10               0.687
OMP Worker Thread #13 (0x3d8e)   cpu_10               0.097
OMP Worker Thread #6 (0x3d87)    cpu_10               0.065
OMP Worker Thread #4 (0x3d85)    cpu_10               0.059
OMP Worker Thread #1 (0x3d82)    cpu_10               0.048
OMP Worker Thread #9 (0x3d8a)    cpu_10               0.034
OMP Worker Thread #11 (0x3d8c)   cpu_10               0.009

Similarly, you can notice the large number of OpenMP worker threads running on cpu_10. 

How to correct thread migration 

We can correct the effects of thread migration by setting the thread affinity.  Thread affinity refers to restricting the execution of certain threads to a subset of the physical processing units in a multiprocessor computer. The Intel® runtime library has the ability to bind OpenMP threads to physical processing units. You can also use the KMP_AFFINITY and KMP_PLACE_THREADS environment variables provided by the Intel® OpenMP runtime to set the thread affinity for your application. 

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.