Developer Guide and Reference

Contents

Thread Affinity Interface (Linux* and Windows*)

The Intel® runtime library has the ability to bind OpenMP* threads to physical processing units. The interface is controlled using the
KMP_AFFINITY
environment variable. Depending on the system (machine) topology, application, and operating system, thread affinity can have a dramatic effect on the application speed.
Thread affinity
restricts execution of certain threads (virtual execution units) to a subset of the physical processing units in a multiprocessor computer. Depending upon the topology of the machine, thread affinity can have a dramatic effect on the execution speed of a program.
Thread affinity is supported on Windows* systems and versions of Linux* systems that have kernel support for thread affinity.
The Intel OpenMP runtime library has the ability to bind OpenMP threads to physical processing units. There are three types of interfaces you can use to specify this binding, which are collectively referred to as the Intel OpenMP Thread Affinity Interface:
  • The high-level affinity interface uses an environment variable to determine the machine topology and assigns OpenMP threads to the processors based upon their physical location in the machine. This interface is controlled entirely by the
    KMP_AFFINITY
    environment variable
    .
  • The mid-level affinity interface uses an environment variable to explicitly specifies which processors (labeled with integer IDs) are bound to OpenMP threads. This interface provides compatibility with the gcc*
    GOMP_AFFINITY
    environment variable, but you can also invoke it by using the
    KMP_AFFINITY
    environment variable. The
    GOMP_AFFINITY
    environment variable is supported on Linux systems only, but users on Windows or Linux systems can use the similar functionality provided by the
    KMP_AFFINITY
    environment variable.
  • The low-level affinity interface uses APIs to enable OpenMP threads to make calls into the OpenMP runtime library to explicitly specify the set of processors on which they are to be run. This interface is similar in nature to
    sched_setaffinity
    and related functions on Linux systems or to
    SetThreadAffinityMask
    and related functions on Windows systems. In addition, you can specify certain options of the
    KMP_AFFINITY
    environment variable to affect the behavior of the low-level API interface. For example, you can set the affinity type
    KMP_AFFINITY
    to disabled, which disables the low-level affinity interface, or you could use the
    KMP_AFFINITY
    or
    GOMP_AFFINITY
    environment variables to set the initial affinity mask, and then retrieve the mask with the low-level API interface.
The following terms are used in this section:
  • The total number of processing elements on the machine is referred to as the number of
    OS thread contexts.
  • Each processing element is referred to as an Operating System processor, or
    O
    S proc
    .
  • Each OS processor has a unique integer identifier associated with it, called an
    OS proc ID
    .
  • The term
    package
    refers to a single or multi-core processor chip.
  • The term
    OpenMP Global Thread ID
    (GTID) refers to an integer which uniquely identifies all threads known to the Intel OpenMP runtime library. The thread that first initializes the library is given GTID 0. In the normal case where all other threads are created by the library and when there is no nested parallelism, then
    n-threads-var
    - 1 new threads are created with GTIDs ranging from 1 to
    ntheads-var
    - 1, and each thread's GTID is equal to the OpenMP thread number returned by function
    omp_get_thread_num
    (). The high-level and mid-level interfaces rely heavily on this concept. Hence, their usefulness is limited in programs containing nested parallelism. The low-level interface does not make use of the concept of a GTID, and can be used by programs containing arbitrarily many levels of parallelism.
Some environment variables are available for both Intel® microprocessors and non-Intel microprocessors, but may perform additional optimizations for Intel® microprocessors than for non-Intel microprocessors.

The
KMP_AFFINITY
Environment Variable

You must set the
KMP_AFFINITY
environment variable before the first parallel region, or certain API calls including
omp_get_max_threads()
,
omp_get_num_procs()
and any affinity API calls, as described in Low Level Affinity API, below.
The
KMP_AFFINITY
environment variable uses the following general syntax:
Syntax
KMP_AFFINITY=[<
modifier
>,...]<
type
>[,<
permute
>][,<
offset
>]
For example, to list a machine topology map, specify
KMP_AFFINITY=verbose,none
to use a
modifier
of
verbose
and a
type
of
none
.
The following table describes the supported specific arguments.
Argument
Default
Description
noverbose
respect
granularity=core
Optional. String consisting of keyword and specifier.
  • granularity=<
    specifier
    >
    takes the following specifiers:
    fine
    ,
    thread
    , core, and tile
  • norespect
  • noverbose
  • nowarnings
  • proclist={<
    proc-list
    >}
  • respect
  • verbose
  • warnings
The syntax for
<
proc-list
>
is explained in mid-level affinity interface.
On Windows with multiple processor groups, the norespect affinity modifier is assumed when the process affinity mask equals a single processor group (which is default on Windows). Otherwise, the respect affinity modifier is used.
none
Required string. Indicates the thread affinity to use.
  • balanced
  • compact
  • disabled
  • explicit
  • none
  • scatter
  • logical
    (deprecated; instead use
    compact
    , but omit any
    permute value
    )
  • physical
    (deprecated; instead use
    scatter,
    possibly with an
    offset value
    )
The
logical
and
physical
types are deprecated but supported for backward compatibility.
0
Optional. Positive integer value. Not valid with type values of
explicit
,
none
, or
disabled
.
0
Optional. Positive integer value. Not valid with type values of
explicit
,
none
, or
disabled
.

Affinity Types

Type is the only required argument.
type = none (default)
Does not bind OpenMP threads to particular thread contexts; however, if the operating system supports affinity, the compiler still uses the OpenMP thread affinity interface to determine machine topology. Specify
KMP_AFFINITY=verbose,none
to list a machine topology map.
type = balanced
Places threads on separate cores until all cores have at least one thread, similar to the
scatter
type. However, when the runtime must use multiple hardware thread contexts on the same core, the
balanced
type ensures that the OpenMP thread numbers are close to each other, which
scatter
does not do. This affinity type is supported on the CPU only for single socket systems.
The OpenMP* environment variable
OMP_PROC_BIND=spread
is similar to
KMP_AFFINITY=balanced
and is available on all platforms, including multi-socket CPU systems.
type = compact
Specifying
compact
assigns the OpenMP thread <
n
>+1 to a free thread context as close as possible to the thread context where the <
n
> OpenMP thread was placed. For example, in a topology map, the nearer a node is to the root, the more significance the node has when sorting the threads.
type = disabled
Specifying
disabled
completely disables the thread affinity interfaces. This forces the OpenMP run-time library to behave as if the affinity interface was not supported by the operating system. This includes the low-level API interfaces such as
kmp_set_affinity
and
kmp_get_affinity
, which have no effect and will return a nonzero error code.
type = explicit
Specifying
explicit
assigns OpenMP threads to a list of OS proc IDs that have been explicitly specified by using the proclist= modifier, which is required for this affinity type. See Explicitly Specifying OS Proc IDs (GOMP_CPU_AFFINITY).
type = scatter
Specifying
scatter
distributes the threads as evenly as possible across the entire system.
scatter
is the opposite of
compact
; so the leaves of the node are most significant when sorting through the machine topology map.
Deprecated Types: logical and physical
Types
logical
and
physical
are deprecated and may become unsupported in a future release. Both are supported for backward compatibility.
For
logical
and
physical
affinity types, a single trailing integer is interpreted as an
offset
specifier instead of a
permute
specifier. In contrast, with
compact
and
scatter
types, a single trailing integer is interpreted as a
permute
specifier.
  • Specifying
    logical
    assigns OpenMP threads to consecutive logical processors, which are also called hardware thread contexts. The type is equivalent to
    compact
    , except that the
    permute
    specifier is not allowed. Thus,
    KMP_AFFINITY=logical,
    n
    is equivalent to
    KMP_AFFINITY=compact,0,
    n
    (this equivalence is true regardless of the whether or not a
    granularity=fine
    modifier is present).
  • Specifying
    physical
    assigns threads to consecutive physical processors (cores). For systems where there is only a single thread context per core, the type is equivalent to
    logical
    . For systems where multiple thread contexts exist per core,
    physical
    is equivalent to
    compact
    with a
    permute
    specifier of 1; that is,
    KMP_AFFINITY=physical,
    n
    is equivalent to
    KMP_AFFINITY=compact,1,
    n
    (regardless of the whether or not a
    granularity=fine
    modifier is present). This equivalence means that when the compiler sorts the map it should permute the innermost level of the machine topology map to the outermost, presumably the thread context level. This type does not support the
    permute
    specifier.
Examples of Types
compact
and
scatter
The following figure illustrates the topology for a machine with two processors, and each processor has two cores; further, each core has Intel® Hyper-Threading Technology (Intel® HT Technology) enabled.
The following figure also illustrates the binding of OpenMP thread to hardware thread contexts when specifying
KMP_AFFINITY=granularity=fine,compact
.
Specifying
scatter
on the same system as shown in the figure above, the OpenMP threads would be assigned the thread contexts as shown in the following figure, which shows the result of specifying
KMP_AFFINITY=granularity=fine,scatter
.

permute and offset combinations

For both
compact
and
scatter
,
permute
and
offset
are allowed; however, if you specify only one integer, the compiler interprets the value as a permute specifier. Both
permute
and
offset
default to 0.
The
permute
specifier controls which levels are most significant when sorting the machine topology map. A value for permute forces the mappings to make the specified number of most significant levels of the sort the least significant, and it inverts the order of significance. The root node of the tree is not considered a separate level for the sort operations.
The
offset
specifier indicates the starting position for thread assignment.
The following figure illustrates the result of specifying
KMP_AFFINITY=granularity=fine,compact,0,3
.
Consider the hardware configuration from the previous example, running an OpenMP application which exhibits data sharing between consecutive iterations of loops. We would therefore like consecutive threads to be bound close together, as is done with
KMP_AFFINITY=compact
, so that communication overhead, cache line invalidation overhead, and page thrashing are minimized. Now, suppose the application also had a number of parallel regions which did not utilize all of the available OpenMP threads. It is desirable to avoid binding multiple threads to the same core and leaving other cores not utilized, since a thread normally executes faster on a core where it is not competing for resources with another active thread on the same core. Since a thread normally executes faster on a core where it is not competing for resources with another active thread on the same core, you might want to avoid binding multiple threads to the same core while leaving other cores unused. The following figure illustrates this strategy of using
KMP_AFFINITY=granularity=fine,compact,1,0
as a setting.
The OpenMP thread
n
+1 is bound to a thread context as close as possible to OpenMP thread
n
, but on a different core. Once each core has been assigned one OpenMP thread, the subsequent OpenMP threads are assigned to the available cores in the same order, but they are assigned on different thread contexts.

Modifier Values for Affinity Types

Modifiers are optional arguments that precede type. If you do not specify a modifier, the
noverbose
,
respect
, and
granularity=core
modifiers are used automatically.
Modifiers are interpreted in order from left to right, and can negate each other. For example, specifying
KMP_AFFINITY=verbose,noverbose,scatter
is therefore equivalent to setting
KMP_AFFINITY=noverbose,scatter
, or just
KMP_AFFINITY=scatter
.
modifier = noverbose (default)
Does not print verbose messages.
modifier = verbose
Prints messages concerning the supported affinity. The messages include information about the number of packages, number of cores in each package, number of thread contexts for each core, and OpenMP thread bindings to physical thread contexts.
Information about binding OpenMP threads to physical thread contexts is indirectly shown in the form of the mappings between hardware thread contexts and the operating system (OS) processor (proc) IDs. The affinity mask for each OpenMP thread is printed as a set of OS processor IDs.
For example, specifying
KMP_AFFINITY=verbose,scatter
on a dual core system with two processors, with Intel® Hyper-Threading Technology (Intel® HT Technology) disabled, results in a message listing similar to the following when then program is executed:
Verbose, scatter message
... KMP_AFFINITY: Affinity capable, using global cpuid info KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3} KMP_AFFINITY: 4 available OS procs - Uniform topology of KMP_AFFINITY: 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map ([] => level not in map): KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0] KMP_AFFINITY: OS proc 2 maps to package 0 core 1 [thread 0] KMP_AFFINITY: OS proc 1 maps to package 3 core 0 [thread 0] KMP_AFFINITY: OS proc 3 maps to package 3 core 1 [thread 0] KMP_AFFINITY: Internal thread 0 bound to OS proc set {0} KMP_AFFINITY: Internal thread 2 bound to OS proc set {2} KMP_AFFINITY: Internal thread 3 bound to OS proc set {3} KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
The verbose modifier generates several standard, general messages. The following table summarizes how to read the messages.
Message String
Description
"affinity capable"
Indicates that all components (compiler, operating system, and hardware) support affinity, so thread binding is possible.
"using global cpuid info"
Indicates that the machine topology was discovered by binding a thread to each operating system processor and decoding the output of the
cpuid
instruction.
"using local cpuid info"
Indicates that compiler is decoding the output of the
cpuid
instruction, issued by only the initial thread, and is assuming a machine topology using the number of operating system processors.
"using /proc/cpuinfo"
Linux only. Indicates that
cpuinfo
is being used to determine machine topology.
"using flat"
Operating system processor ID is assumed to be equivalent to physical package ID. This method of determining machine topology is used if none of the other methods will work, but may not accurately detect the actual machine topology.
"uniform topology of"
The machine topology map is a full tree with no missing leaves at any level.
The mapping from the operating system processors to thread context ID is printed next. The binding of OpenMP thread context ID is printed next unless the affinity type is
none
. The thread level is contained in brackets (in the listing shown above). This implies that there is no representation of the thread context level in the machine topology map. For more information, see Determining Machine Topology.
modifier = granularity
Binding OpenMP threads to particular packages and cores will often result in a performance gain on systems with Intel processors with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled; however, it is usually not beneficial to bind each OpenMP thread to a particular thread context on a specific core. Granularity describes the lowest levels that OpenMP threads are allowed to float within a topology map.
This modifier supports the following additional specifiers.
Specifier
Description
core
Default. Allows all the OpenMP threads bound to a core to float between the different thread contexts.
fine
or
thread
The finest granularity level. Causes each OpenMP thread to be bound to a single thread context. The two specifiers are functionally equivalent.
tile
Allows all the OpenMP threads bound to a tile to float between the different thread contexts of cores the tile consists of.
Specifying
KMP_AFFINITY=verbose,granularity=core,compact
on the same dual core system with two processors as in the previous section, but with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, results in a message listing similar to the following when the program is executed:
Verbose, granularity=core,compact message
KMP_AFFINITY: Affinity capable, using global cpuid info KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7} KMP_AFFINITY: 8 available OS procs - Uniform topology of KMP_AFFINITY: 2 packages x 2 cores/pkg x 2 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map ([] => level not in map): KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 KMP_AFFINITY: OS proc 4 maps to package 0 core 0 thread 1 KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0 KMP_AFFINITY: OS proc 6 maps to package 0 core 1 thread 1 KMP_AFFINITY: OS proc 1 maps to package 3 core 0 thread 0 KMP_AFFINITY: OS proc 5 maps to package 3 core 0 thread 1 KMP_AFFINITY: OS proc 3 maps to package 3 core 1 thread 0 KMP_AFFINITY: OS proc 7 maps to package 3 core 1 thread 1 KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,4} KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,4} KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,6} KMP_AFFINITY: Internal thread 3 bound to OS proc set {2,6} KMP_AFFINITY: Internal thread 4 bound to OS proc set {1,5} KMP_AFFINITY: Internal thread 5 bound to OS proc set {1,5} KMP_AFFINITY: Internal thread 6 bound to OS proc set {3,7} KMP_AFFINITY: Internal thread 7 bound to OS proc set {3,7}
The affinity mask for each OpenMP thread is shown in the listing (above) as the set of operating system processor to which the OpenMP thread is bound.
The following figure illustrates the machine topology map, for the above listing, with OpenMP thread bindings.
In contrast, specifying
KMP_AFFINITY=verbose,granularity=fine,compact
or
KMP_AFFINITY=verbose,granularity=thread,compact
binds each OpenMP thread to a single hardware thread context when the program is executed:
Verbose, granularity=fine,compact message
KMP_AFFINITY: Affinity capable, using global cpuid info KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7} KMP_AFFINITY: 8 available OS procs - Uniform topology of KMP_AFFINITY: 2 packages x 2 cores/pkg x 2 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map ([] => level not in map): KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 KMP_AFFINITY: OS proc 4 maps to package 0 core 0 thread 1 KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0 KMP_AFFINITY: OS proc 6 maps to package 0 core 1 thread 1 KMP_AFFINITY: OS proc 1 maps to package 3 core 0 thread 0 KMP_AFFINITY: OS proc 5 maps to package 3 core 0 thread 1 KMP_AFFINITY: OS proc 3 maps to package 3 core 1 thread 0 KMP_AFFINITY: OS proc 7 maps to package 3 core 1 thread 1 KMP_AFFINITY: Internal thread 0 bound to OS proc set {0} KMP_AFFINITY: Internal thread 1 bound to OS proc set {4} KMP_AFFINITY: Internal thread 2 bound to OS proc set {2} KMP_AFFINITY: Internal thread 3 bound to OS proc set {6} KMP_AFFINITY: Internal thread 4 bound to OS proc set {1} KMP_AFFINITY: Internal thread 5 bound to OS proc set {5} KMP_AFFINITY: Internal thread 6 bound to OS proc set {3} KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}
The OpenMP to hardware context binding for this example was illustrated in the first example.
Specifying
granularity=fine
will always cause each OpenMP thread to be bound to a single OS processor. This is equivalent to
granularity=thread
, currently the finest granularity level.
modifier = respect (default)
Respect the process' original affinity mask, or more specifically, the affinity mask in place for the thread that initializes the OpenMP run-time library. The behavior differs between Linux and Windows:
  • On Windows: Respect original affinity mask for the process.
  • On Linux: Respect the affinity mask for the thread that initializes the OpenMP run-time library.
Specifying
KMP_AFFINITY=verbose,compact
for the same system used in the previous example, with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, and invoking the library with an initial affinity mask of {4,5,6,7} (thread context 1 on every core) causes the compiler to model the machine as a dual core, two-processor system with Intel® HT Technology disabled.
Verbose,compact message
KMP_AFFINITY: Affinity capable, using global cpuid info KMP_AFFINITY: Initial OS proc set respected: {4,5,6,7} KMP_AFFINITY: 4 available OS procs - Uniform topology of KMP_AFFINITY: 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map ([] => level not in map): KMP_AFFINITY: OS proc 4 maps to package 0 core 0 [thread 1] KMP_AFFINITY: OS proc 6 maps to package 0 core 1 [thread 1] KMP_AFFINITY: OS proc 5 maps to package 3 core 0 [thread 1] KMP_AFFINITY: OS proc 7 maps to package 3 core 1 [thread 1] KMP_AFFINITY: Internal thread 0 bound to OS proc set {4} KMP_AFFINITY: Internal thread 1 bound to OS proc set {6} KMP_AFFINITY: Internal thread 2 bound to OS proc set {5} KMP_AFFINITY: Internal thread 3 bound to OS proc set {7} KMP_AFFINITY: Internal thread 4 bound to OS proc set {4} KMP_AFFINITY: Internal thread 5 bound to OS proc set {6} KMP_AFFINITY: Internal thread 6 bound to OS proc set {5} KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}
Because there are eight thread contexts on the machine, by default the compiler created eight threads for an OpenMP
parallel
construct.
The brackets around thread 1 indicate that the thread context level is ignored, and is not present in the topology map. The following figure illustrates the corresponding machine topology map.
When using the local
cpuid
information to determine the machine topology, it is not always possible to distinguish between a machine that does not support Intel® Hyper-Threading Technology (Intel® HT Technology) and a machine that supports it, but has it disabled. Therefore, the compiler does not include a level in the map if the elements (nodes) at that level had no siblings, with the exception that the package level is always modeled. As mentioned earlier, the package level will always appear in the topology map, even if there only a single package in the machine.
modifier = norespect
Do not respect original affinity mask for the process. Binds OpenMP threads to all operating system processors.
In early versions of the OpenMP run-time library that supported only the
physical
and
logical
affinity types,
norespect
was the default and was not recognized as a modifier.
The default was changed to
respect
when types
compact
and
scatter
were added; therefore, thread bindings for the
logical
and
physical
affinity types may have changed with the newer compilers in situations where the application specified a partial initial thread affinity mask.
modifier = nowarnings
Do not print warning messages from the affinity interface.
modifier = warnings (default)
Print warning messages from the affinity interface (default).

Determining Machine Topology

On IA-32 and Intel® 64 architecture systems, if the package has an APIC (Advanced Programmable Interrupt Controller), the compiler will use the
cpuid
instruction to obtain the
package id
,
core id
, and
thread context id
. Under normal conditions, each thread context on the system is assigned a unique APIC ID at boot time. The compiler obtains other pieces of information obtained by using the
cpuid
instruction, which together with the number of OS thread contexts (total number of processing elements on the machine), determine how to break the APIC ID down into the
package ID
,
core ID
, and
thread context ID
.
There are two ways to specify the APIC ID in the
cpuid
instruction - the legacy method in leaf 4, and the more modern method in leaf 11. Only 256 unique APIC IDs are available in leaf 4. Leaf 11 has no such limitation.
Normally, all
core ids
on a package and all
thread context ids
on a core are contiguous; however, numbering assignment gaps are common for
package ids
, as shown in the figure above.
If the compiler cannot determine the machine topology using any other method, but the operating system supports affinity, a warning message is printed, and the topology is assumed to be
flat
. For example, a flat topology assumes the operating system process
N
maps to package
N
, and there exists only one thread context per core and only one core for each package.
If the machine topology cannot be accurately determined as described above, the user can manually copy
/proc/cpuinfo
to a temporary file, correct any errors, and specify the machine topology to the OpenMP runtime library via the environment variable
KMP_CPUINFO_FILE=<temp_filename>
, as described in the section
KMP_CPUINFO_FILE
and
/proc/cpuinfo
.
Regardless of the method used in determining the machine topology, if there is only one thread context per core for every core on the machine, the thread context level will not appear in the topology map. If there is only one core per package for every package in the machine, the core level will not appear in the machine topology map. The topology map need not be a full tree, because different packages may contain a different number of cores, and different cores may support a different number of thread contexts.
The package level will always appear in the topology map, even if there only a single package in the machine.

KMP_CPUINFO_FILE
and
/proc/cpuinfo

One of the methods the
Intel® oneAPI
DPC++/C++
Compiler
OpenMP runtime library can use to detect the machine topology on Linux systems is to parse the contents of
/proc/cpuinfo
. If the contents of this file (or a device mapped into the Linux file system) are insufficient or erroneous, you can consider copying its contents to a writable temporary file
<
temp_file
>
, correct it or extend it with the necessary information, and set
KMP_CPUINFO_FILE=<
temp_file
>
.
If you do this, the OpenMP runtime library will read the
<
temp_file
>
location pointed to by
KMP_CPUINFO_FILE
instead of the information contained in
/proc/cpuinfo
or attempting to detect the machine topology by decoding the APIC IDs. That is, the information contained in the <
temp_file
> overrides these other methods. You can use the KMP_CPUINFO_FILE interface on Windows systems, where
/proc/cpuinfo
does not exist.
The content of
/proc/cpuinfo
or
<
temp_file
>
should contain a list of entries for each processing element on the machine. Each processor element contains a list of entries (descriptive name and value on each line). A blank line separates the entries for each processor element. Only the following fields are used to determine the machine topology from each entry, either in <
temp_file
> or
/proc/cpuinfo
:
Field
Description
processor :
Specifies the OS ID for the processing element. The OS ID must be unique. The
processor
and
physical id
fields are the only ones that are required to use the interface.
physical id :
Specifies the package ID, which is a physical chip ID. Each package may contain multiple cores. The package level always exists in the compiler's OpenMP run-time library model of the machine topology.
core id :
Specifies the core ID. If it does not exist, it defaults to 0. If every package on the machine contains only a single core, the core level will not exist in the machine topology map (even if some of the core ID fields are non-zero).
thread id :
Specifies the thread ID. If it does not exist, it defaults to 0. If every core on the machine contains only a single thread, the thread level will not exist in the machine topology map (even if some thread ID fields are non-zero).
node_
n
id :
This is a extension to the normal contents of
/proc/cpuinfo
that can be used to specify the nodes at different levels of the memory interconnect on Non-Uniform Memory Access (NUMA) systems. Arbitrarily many levels
n
are supported. The node_0 level is closest to the package level; multiple packages comprise a node at level 0. Multiple nodes at level 0 comprise a node at level 1, and so on.
Each entry must be spelled exactly as shown, in lowercase, followed by optional whitespace, a colon (:), more optional whitespace, then the integer ID. Fields other than those listed are simply ignored.
It is common for the
thread id
field to be missing from
/proc/cpuinfo
on many Linux variants, and for a field labeled
siblings
to specify the number of threads per node or number of nodes per package. However, the Intel OpenMP runtime library ignores fields labeled
siblings
so it can distinguish between the
thread id
and
siblings
fields. When this situation arises, the warning message
Physical node/pkg/core/thread ids not unique
appears (unless the
type specified is nowarnings
).

Windows Processor Groups

On a 64-bit Windows operating system, it is possible for multiple processor groups to accommodate more than 64 processors. Each group is limited in size, up to a maximum value of sixty-four (64) processors.
If multiple processor groups are detected, the default is to model the machine as a 2-level tree, where level 0 are for the processors in a group, and level 1 are for the different groups. Threads are assigned to a group until there are as many OpenMP threads bound to the groups as there are processors in the group. Subsequent threads are assigned to the next group, and so on.
By default, threads are allowed to float among all processors in a group, that is to say, granularity equals the group [granularity=group]. You can override this binding and explicitly use another affinity type like compact, scatter, and so on. If you do so, the granularity must be sufficiently fine to prevent a thread from being bound to multiple processors in different groups.

Using a Specific Machine Topology Modeling Method (KMP_TOPOLOGY_METHOD)

You can set the
KMP_TOPOLOGY_METHOD
environment variable to force OpenMP to use a particular machine topology modeling method.
Value
Description
cpuid_leaf11
Decodes the APIC identifiers as specfied by leaf 11 of the
cpuid
instruction.
cpuid_leaf4
Decodes the APIC identifiers as specified in leaf 4 of the
cpuid
instruction.
cpuinfo
If
KMP_CPUINFO_FILE
is not specified, forces OpenMP to parse
/proc/cpuinfo
to determine the topology (Linux only).
If
KMP_CPUINFO_FILE
is specified as described above, uses it (Windows or Linux).
group
Models the machine as a 2-level map, with level 0 specifying the different processors in a group, and level 1 specifying the different groups (Windows 64-bit only) .
flat
Models the machine as a flat (linear) list of processors.
hwloc
Models the machine as the Portable Hardware Locality* (hwloc) library does. This model is the most detailed and includes, but is not limited to: numa nodes, packages, cores, hardware threads, caches, and Windows processor groups.

Explicitly Specifying OS Processor IDs (GOMP_CPU_AFFINITY)

You must set the
GOMP_CPU_AFFINITY
environment variable before the first parallel region, or certain API calls including
omp_get_max_threads()
,
omp_get_num_procs()
and any affinity API calls, as described in Low Level Affinity API, below.
Instead of allowing the library to detect the hardware topology and automatically assign OpenMP threads to processing elements, the user may explicitly specify the assignment by using a list of operating system (OS) processor (proc) IDs. However, this requires knowledge of which processing elements the OS proc IDs represent.
On Linux systems, when using the Intel OpenMP compatibility libraries enabled by the compiler option
-qopenmp-lib compat
, you can use the
GOMP_AFFINITY
environment variable to specify a list of OS processor IDs. Its syntax is identical to that accepted by
libgomp
(assume that <
proc_list
> produces the entire
GOMP_AFFINITY
environment string):
Value
Description
<
proc_list
> :=
<entry> | <
elem
> , <list> | <elem> <whitespace> <list>
<
elem
> :=
<
proc_spec
> | <
range
>
<
proc_spec
> :=
<
proc_id
>
<
range
> :=
<
proc_id
> - <
proc_id
> | <
proc_id
> - <
proc_id
> : <int>
<
proc_id
> :=
<positive_int>
OS processors specified in this list are then assigned to OpenMP threads, in order of OpenMP Global Thread IDs. If more OpenMP threads are created than there are elements in the list, then the assignment occurs modulo the size of the list. That is, OpenMP Global Thread ID
n
is bound to list element
n
mod <
list_size
>.
Consider the machine previously mentioned: a dual core, dual-package machine without Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, where the OS proc IDs are assigned in the same manner as the example in a previous figure. Suppose that the application creates six OpenMP threads instead of 4 (the default), oversubscribing the machine. If
GOMP_AFFINITY
=3,0-2,
then OpenMP threads are bound as shown in the figure below, just as should happen when compiling with
gcc
and linking with
libgomp
:
The same syntax can be used to specify the OS proc ID list in the proclist=[<proc_list>] modifier in the
KMP_AFFINITY
environment variable string. There is a slight difference: in order to have strictly the same semantics as in the
gcc
OpenMP runtime library
libgomp
: the
GOMP_AFFINITY
environment variable implies
granularity=fine
. If you specify the OS proc list in the
KMP_AFFINITY
environment variable without a
granularity=
specifier, then the default
granularity
is not changed. That is, OpenMP threads are allowed to float between the different thread contexts on a single core. Thus
GOMP_AFFINITY
=<proc_list>
is an alias for
KMP_AFFINITY
="granularity=fine,proclist=[<proc_list>],explicit"
.
In the
KMP_AFFINITY
environment variable string, the syntax is extended to handle operating system processor ID sets. The user may specify a set of operating system processor IDs among which an OpenMP thread may execute ("œfloat") enclosed in brackets:
Value
Description
<
proc_list
> :=
<proc_id> | { <float_list> }
<
float_list
> :=
<proc_id> | <proc_id> , <float_list>
This allows functionality similarity to the
granularity= specifier
, but it is more flexible. The OS processors on which an OpenMP thread executes may exclude other OS processors nearby in the machine topology, but include other distant OS processors. Building upon the previous example, we may allow OpenMP threads 2 and 3 to "œfloat" between OS processor 1 and OS processor 2 by using
KMP_AFFINITY
="granularity=fine,proclist=[3,0,{1,2},{1,2}],explicit
", as shown in the figure below:
If
verbose
were also specified, the output when the application is executed would include:
KMP_AFFINITY="
granularity=verbose,fine,proclist=[3,0,{1,2},{1,2}],explicit
"
KMP_AFFINITY: Affinity capable, using global cpuid info KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3} KMP_AFFINITY: 4 available OS procs - Uniform topology of KMP_AFFINITY: 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) KMP_AFFINITY: OS proc to physical thread map ([] => level not in map): KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0] KMP_AFFINITY: OS proc 2 maps to package 0 core 1 [thread 0] KMP_AFFINITY: OS proc 1 maps to package 3 core 0 [thread 0] KMP_AFFINITY: OS proc 3 maps to package 3 core 1 [thread 0] KMP_AFFINITY: Internal thread 0 bound to OS proc set {3} KMP_AFFINITY: Internal thread 1 bound to OS proc set {0} KMP_AFFINITY: Internal thread 2 bound to OS proc set {1,2} KMP_AFFINITY: Internal thread 3 bound to OS proc set {1,2} KMP_AFFINITY: Internal thread 4 bound to OS proc set {3} KMP_AFFINITY: Internal thread 5 bound to OS proc set {0}}

Low Level Affinity API

Instead of relying on the user to specify the OpenMP thread to OS proc binding by setting an environment variable before program execution starts (or by using the
kmp_settings
interface before the first parallel region is reached), each OpenMP thread can determine the desired set of OS procs on which it is to execute and bind to them with the
kmp_set_affinity
API call.
When you use this affinity interface you take complete control of the hardware resources on which your threads run. To do that sensibly you need to understand in detail how the logical CPUs, the enumeration of hardware threads controlled by the OS, map to the physical hardware of the specific machine on which you are running. That mapping can be, and likely is, different on different machines, so you risk binding machine-specific information into your code, which can result in explicitly forcing bad affinities when your code runs on a different machine. And if you are concerned with optimization at this level of detail, your code is probably valuable, and therefore will probably move to another machine.
This interface may also allow you to ignore the resource limitations that were set by the program startup mechanism, such as Message Passing Interface (MPI), specifically to prevent multiple OpenMP processes on the same node from using the same hardware threads. Again, this can result in explicitly forcing affinities that cause bad performance, and the OpenMP runtime will neither prevent this from happening, nor warn you when it does. These are expert interfaces and you must use them with caution.
It is recommended, therefore, to use the higher level affinity settings if you possibly can, because they are more portable and do not require this low level knowledge.
The C/C++ API interfaces follow, where the type name
kmp_affinity_mask_t
is defined in
omp.h
:
Some of these interfaces have offload equivalents. The offload equivalent takes two additional arguments to specify the target type and target number. For more information, see
Calling Functions on the CPU to Modify the Coprocessor's Execution Environment
. Offload is not supported on Windows systems.
Syntax
Description
int kmp_set_affinity (kmp_affinity_mask_t *mask)
Sets the affinity mask for the current OpenMP thread to
*
mask
, where
*
mask
is a set of OS proc IDs that has been created using the API calls listed below, and the thread will only execute on OS procs in the set. Returns either a zero (
0
) upon success or a nonzero error code.
int kmp_get_affinity (kmp_affinity_mask_t *mask)
Retrieves the affinity mask for the current OpenMP thread, and stores it in
*
mask
, which must have previously been initialized with a call to
kmp_create_affinity_mask()
. Returns either a zero (
0
) upon success or a nonzero error code.
int kmp_get_affinity_max_proc (void)
Returns the maximum OS proc ID that is on the machine, plus 1. All OS proc IDs are guaranteed to be between 0 (inclusive) and
kmp_get_affinity_max_proc()
(exclusive).
void kmp_create_affinity_mask (kmp_affinity_mask_t *mask)
Allocates a new OpenMP thread affinity mask, and initializes
*
mask
to the empty set of OS procs. The implementation is free to use an object of
kmp_affinity_mask_kind
either as the set itself, a pointer to the actual set, or an index into a table describing the set. Do not make any assumption as to what the actual representation is.
void kmp_destroy_affinity_mask (kmp_affinity_mask_t *mask)
Deallocates the OpenMP thread affinity mask. For each call to
kmp_create_affinity_mask()
, there should be a corresponding call to
kmp_destroy_affinity_mask()
.
int kmp_set_affinity_mask_proc (int proc, kmp_affinity_mask_t *mask)
Adds the OS proc ID proc to the set
*
mask
, if it is not already. Returns either a zero (
0
) upon success or a nonzero error code.
int kmp_unset_affinity_mask_proc (int proc, kmp_affinity_mask_t *mask)
If the OS proc ID proc is in the set
*
mask
, it removes it. Returns either a zero (
0
) upon success or a nonzero error code.
int kmp_get_affinity_mask_proc (int proc, kmp_affinity_mask_t *mask)
Returns
1
if the OS proc ID proc is in the set
*
mask
; if not, it returns
0
.
Once an OpenMP thread has set its own affinity mask via a successful call to
kmp_set_affinity()
, then that thread remains bound to the corresponding OS proc set until at least the end of the parallel region, unless reset via a subsequent call to
kmp_set_affinity()
.
Between parallel regions, the affinity mask (and the corresponding OpenMP thread to OS proc bindings) can be considered thread private data objects, and have the same persistence as described in the OpenMP Application Program Interface. For more information, see the OpenMP API specification (http://www.openmp.org), some relevant parts of which are provided below:
In order for the affinity mask and thread binding to persist between two consecutive active parallel regions, all three of the following conditions must hold:
  • Neither parallel region is nested inside another explicit parallel region.
  • The number of threads used to execute both parallel regions is the same.
  • The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions."
Therefore, by creating a parallel region at the start of the program whose sole purpose is to set the affinity mask for each thread, you can mimic the behavior of the
KMP_AFFINITY
environment variable with low-level affinity API calls, if program execution obeys the three aforementioned rules from the OpenMP specification.
The following example shows how these low-level interfaces can be used. This code binds the executing thread to the specified logical CPU:
Example
// Force the executing thread to execute on logical CPU i // Returns 1 on success, 0 on failure. int forceAffinity(int i) { kmp_affinity_mask_t mask; kmp_create_affinity_mask(&mask); kmp_set_affinity_mask_proc(i, &mask); return (kmp_set_affinity(&mask) == 0); }
This program fragment was written with knowledge about the mapping of the OS proc IDs to the physical processing elements of the target machine. On another machine, or on the same machine with a different OS installed, the program would still run, but the OpenMP thread to physical processing element bindings could differ and you might be explicitly force a bad distribution.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.