What is thread migration?
Today’s complex operating systems use a scheduler to assign application threads, also called as software (SW) threads, to processor cores. The scheduler may choose the placement of the application threads on the physical cores depending on a number of different factors such as system state, system policies, etc. A SW thread may execute on a core for some period of time before being swapped out to wait. A SW thread may have to wait for a number of reasons, such as being blocked for I/O. If available, another SW thread may be given a chance to execute on this core. When the original SW thread is once again available to execute, the scheduler may migrate the thread over to another core to ensure timely execution. This poses a problem to the newer computing architectures as this SW thread migration disassociates the thread from data that has already been fetched into the caches resulting in longer data access latencies. This problem is further amplified in Non-Uniform Memory Access (NUMA) architectures, where each processor has its own local memory module that it can access directly with a distinct performance advantage. In a NUMA architecture, when a SW thread is migrated to another core, the data stored in the earlier core’s local memory becomes remote and memory access times increase significantly. Hence, thread migration can hurt performance making it important to identify if it is occurring in your application.
How to identify thread migration
Intel VTune Amplifier XE can be used to identify SW thread migration in applications running on Intel architectures. You can either use the Intel VTune Amplifier XE GUI or the command line interface to identify thread migration. To identify OS thread migration, run a hotspot analysis or lightweight hotspot analysis on your application.
$/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -group-by thread,cpuid -report hotspots -r <results_directory> -s "H/W Context" -q | less
Command Line Interface: Thread H/W Context CPU Time:Self ------------------------------ ----------- ------------- OMP Worker Thread #5 (0x3d86) cpu_0 0.004 matmul-intel64 (0x3d52) cpu_1 0.013 OMP Worker Thread #15 (0x3d90) cpu_10 2.418 matmul-intel64 (0x3d52) cpu_10 2.023 OMP Worker Thread #8 (0x3d89) cpu_10 0.687 OMP Worker Thread #13 (0x3d8e) cpu_10 0.097 OMP Worker Thread #6 (0x3d87) cpu_10 0.065 OMP Worker Thread #4 (0x3d85) cpu_10 0.059 OMP Worker Thread #1 (0x3d82) cpu_10 0.048 OMP Worker Thread #9 (0x3d8a) cpu_10 0.034 OMP Worker Thread #11 (0x3d8c) cpu_10 0.009
Similarly, you can notice the large number of OpenMP worker threads running on cpu_10.
How to correct thread migration
We can correct the effects of thread migration by setting the thread affinity. Thread affinity refers to restricting the execution of certain threads to a subset of the physical processing units in a multiprocessor computer. The Intel® runtime library has the ability to bind OpenMP threads to physical processing units. You can also use the KMP_AFFINITY and KMP_PLACE_THREADS environment variables provided by the Intel® OpenMP runtime to set the thread affinity for your application.