Setting thread affinity on SMT or HT enabled systems for better performance


Intel® Hyper-Threading Technology (Intel® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology.

If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation.

For Intel MKL, You can obtain best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. Use one of the following options:

  • OpenMP facilities (if available), for example, theKMP_AFFINITY environment variable using the Intel OpenMP library

set KMP_AFFINITY=granularity=fine,compact,1,0

Consider the following performance issue:

  • The system has two sockets with two cores each, for a total of four cores (CPUs).
  • The application sets the number of OpenMP threads to two and calls Intel MKL to perform a Fourier transform. This call takes considerably different amounts of time from run to run.

To resolve this issue, before calling Intel MKL, set an affinity mask for each OpenMP thread using the KMP_AFFINITY environment variable or the sched_setaffinity system function. The following code example shows how to resolve the issue by setting an affinity mask by operating system means using the Intel compiler. The code calls the function sched_setaffinityto bind the threads to the cores on different sockets. Then the Intel MKL FFT functions called:

#define _GNU_SOURCE //for using the GNU CPU affinity
/****** (works with the appropriate kernel and glibc) ********/
/******************  Set affinity mask  **************************/
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main(void) {
    int NCPUs = sysconf(_SC_NPROCESSORS_CONF);
    printf("Using thread affinity on %i NCPUs\n", NCPUs);
    #pragma omp parallel default(shared)
    cpu_set_t new_mask;
    cpu_set_t was_mask;
     int tid = omp_get_thread_num();


   /* 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) */
   CPU_SET(tid==0 ? 0 : 2, &new_mask);

    if (sched_getaffinity(0, sizeof(was_mask), &was_mask) == -1) {
         printf("Error: sched_getaffinity(%d, sizeof(was_mask), &was_mask)\n", tid);
    if (sched_setaffinity(0, sizeof(new_mask), &new_mask) == -1) {
          printf("Error: sched_setaffinity(%d, sizeof(new_mask), &new_mask)\n", tid);
     printf("tid=%d new_mask=%08X was_mask=%08X\n", tid, *(unsigned int*)(&new_mask), *(unsigned int*)(&was_mask));
   /******* Call Intel MKL FFT function  *********/

    // ..................
    return 0;

Compile the application with the Intel compiler using the following command:

icc test_application.c -openmp 

wheretest_application.cis the filename for the application.

Build the application. Run it in two threads, for example, by using the environment variable to set the number of threads:

env OMP_NUM_THREADS=2 ./a.out

See the Linux Programmer's Manual (in man pages format) for particulars of the sched_setaffinityfunction used in the above example.

For an additional info how to improve performance on Intel Xeon Phi Coprocessors, please refer to the MKL Developer Guide follow the link :


Optimization Notice in English

For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Aleksey Y.'s picture

BTW, in latest version of Composer XE 2013 on Windows MKL linpack uses "KMP_AFFINITY=nowarnings,compact,granularity=fine", which gives half of peak performance on my HT enabled Core i7 950.

Linux version of MKL uses right option which you wrote above

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.