Thread affinity compiler options and environment variables for Intel® processors


Introduction : New 11.1 compiler options allow thread affinity to be compiled into the executable. The compiled-in affinity of the executable will override any setting of the affinity environment variables KMP_AFFINITY and GOMP_CPU_AFFINITY. Usage of the affinity compiler options and environment variables is demonstrated. The thread affinity compiler options and environment variables are supported only for Intel® processors.


Version : Intel® C++ and Fortran Compilers for Windows* (versions 11.1.048 or higher)
Intel® C++ and Fortran Compilers for Linux* (versions 11.1.056 or higher)



Application Notes : The thread affinity compiler options are /Qpar-affinity (Windows*) or -par-affinity (Linux*). You must compile the main program with these options for them to have any effect. Further, these options only have an effect if /Qopenmp and/or /Qparallel (Windows*), or -openmp and/or -parallel (Linux*) have also been specified. The /Qopenmp (-openmp) and /Qparallel (-parallel) compiler options are available for both Intel® and non-Intel processors, but may result in more optimizations for Intel® processors than for non-Intel processors.


Obtaining Source Code : The compiler OpenMP samples may be used.


Prerequisites : Thread affinity is supported on Windows* OS systems and versions of Linux* OS systems that have kernel support for thread affinity. The compiler, the OS, and the hardware must all support thread affinity for the thread affinity compiler options and environment variables to have any effect.


Configuration Set Up : For compiled-in affinity to be effective, the processor configuration of the deployment target must be known, including number of processor packages, number of cores per package, and number of threads per core.


Source Code Changes : N/A


Building and Running the Application : The thread affinity compiler options have the same syntax as the KMP_AFFINITY environment variable. The following example uses the 'explicit' affinity type, which assigns threads to a list of operating system (OS) processor (proc) IDs. This provides the most precise control, but requires knowledge of which processing elements the OS proc IDs represent. The optional 'verbose' modifier is used to display the OS proc IDs when the program is run.

The test machine contains two Intel® Xeon® X5560 (Nehalem) processors (2.8 GHz clock speed, 8 MB L3 cache), with each processor containing 4 hyper-threaded cores (16 logical threads). The operating system is SLES 10 x86_64 Linux version 2.6.16.60-0.21-smp, and the test machine has 6 GB memory.

The program is run on only 4 threads to clearly show how thread placement can substantially effect runtime performance, and how the affinity compiler options override the KMP_AFFINITY environment variable setting.

1) Set the number of threads to 4 (this could also be done programmatically), and set KMP_AFFINITY:
>export OMP_NUM_THREADS=4
>export KMP_AFFINITY=verbose,granularity=thread,proclist=[0,1,2,3],explicit

2) First, compile and run the program *WITHOUT* -par-affinity:
> icc -V
Intel® C Intel® 64 Compiler Professional for applications running on Intel® 64, Version 11.1 Build x Package ID: l_cproc_p_11.1.072

>icc -openmp matmul2_thread_util.cpp -o matmul2_thread_util.x
>time ./matmul2_thread_util.x

Using omp_get_wtime() for wall clock time
Problem size: c(1600,6400) = a(1600,3200)*b(3200,6400)
OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #156: KMP_AFFINITY: 16 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 4 cores/pkg x 2 threads/core (8 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 1 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 1 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 1 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 1 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 3 thread 1

Calculating product 1 time(s)
We are using 4 thread(s)...

OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}

Time for thread 0 = 21.6395 sec
Time for thread 1 = 22.1331 sec
Time for thread 3 = 22.4741 sec
Time for thread 2 = 22.4742 sec
Matmul kernel wall clock time = 22.4742 sec
MFlops = 2916.06

Utilization for thread 0 = 96.2863 %
Utilization for thread 1 = 98.4826 %
Utilization for thread 2 = 99.9999 %
Utilization for thread 3 = 99.9998 %
Expected value for each matrix element is 5121600
Checking that all 10240000 elements of c[i][j] = 5121600...done

===>>> Solution Validates <<<===

real 0m22.696s
user 1m27.597s
sys 0m0.080s
>

In the above, the Internal threads were assigned to OS procs [0,1,2,3], which correspond to package (processor) 0, cores 0 and 1, and the hyper-threads on cores 0 and 1. No use was made of cores 2 or 3 on package 0 (OS procs 4-7), nor were any of the cores on package 1 utilized (OS procs 8-15).

3) Now, compile with -par-affinity and run the program. Two points are illustrated. First, it is seen that the -par-affinity setting overrides the KMP_AFFINITY setting. Second, one thread is assigned to each core in package 0, which gives a substantial program speedup. This alternative assigns one thread per core, instead of running two cores with hyper-threading.

> icc -openmp matmul2_thread_util.cpp -o matmul2_thread_util.x
-par-affinity=verbose,granularity=thread,proclist=[0,2,4,6],explicit

> time ./matmul2_thread_util.x

Using omp_get_wtime() for wall clock time
Problem size: c(1600,6400) = a(1600,3200)*b(3200,6400)
OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #156: KMP_AFFINITY: 16 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 4 cores/pkg x 2 threads/core (8 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 1 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 1 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 1 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 1 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 3 thread 1

OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6}

Calculating product 1 time(s)
We are using 4 thread(s)...

Time for thread 0 = 11.7956 sec
Time for thread 3 = 11.8539 sec
Time for thread 2 = 11.854 sec
Time for thread 1 = 11.854 sec
Matmul kernel wall clock time = 11.854 sec
MFlops = 5528.6

Utilization for thread 0 = 99.5075 %
Utilization for thread 1 = 99.9999 %
Utilization for thread 2 = 99.9998 %
Utilization for thread 3 = 99.9995 %
Expected value for each matrix element is 5121600
Checking that all 10240000 elements of c[i][j] = 5121600...done

===>>> Solution Validates <<<===

real 0m12.000s
user 0m47.119s
sys 0m0.048s
>

Verifying Correctness : The -par-affinity 'verbose' option can be used to verify the threads execute on the desired processing elements.


Benefits : Compiled-in thread affinity allows a multi-threaded executable to be tuned for a specific processor configuration without having to use environment variables or other thread affinity mechanisms. The executable is optimized for deployment on homogeneous targets, for example, as are commonly used in clusters.

In this example, running 4 threads on 4 cores achieved a runtime speedup of 22.696/12.000 = 1.89133, or about 89%, compared to running 4 threads on 2 cores using hyper-threading.


Known Issues or Limitations : The thread affinity interface is supported only for Intel® processors, and is not supported by Mac OS* X.


Application Notes : For this particular example, a simple matrix multiplication problem, hyper-threading was not as effective as using one thread per processor core. This may or may not be the case for any given program, depending on many factors including cache size and architecture, processor type, and operating system. Experimentation is encouraged!

An alternative way of specifying that each hyper-threaded core is to run only one thread is to limit the total number of threads to the number of cores (or subset of cores) and compile with -par-affinity=scatter.

On Linux* systems (only), environment variable GOMP_CPU_AFFINITY is available as an alternative way of explicitly specifying OS proc IDs. For this to work, the program must be compiled to use the Intel OpenMP compatibility libraries (-openmp-lib compat), which is the default. For example, to specify that the program is to run on OS proc IDs [0,2,4,6]:
export GOMP_CPU_AFFINITY=0,2,4,6

The equivalent way of specifying this using -par-affinity:
-par-affinity=granularity=thread,proclist=[0,2,4,6],explicit

Note that if both KMP_AFFINITY and GOMP_CPU_AFFINITY have been set in the environment, that KMP_AFFINITY will take precedence. So on Linux* systems, the thread placement precedence is given by:
-par-affinity > KMP_AFFINITY > GOMP_CPU_AFFINITY

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

For more complete information about compiler optimizations, see our Optimization Notice.