Compiler Methodology for Intel® MIC Architecture
Efficient Parallelization, OpenMP Thread Affinity Control
The Intel® OpenMP* runtime library has the ability to bind OpenMP threads to physical processing units. The interface is controlled using the KMP_AFFINITY environment variable, and with compilers version 13.1.0 (Intel® Composer XE 2013 Update 2) and newer, the KMP_PLACE_THREADS environment variable. Depending on the system (machine) topology, application, and operating system, thread affinity can have a dramatic effect on the application speed.
Thread affinity restricts execution of certain threads (virtual execution units) to a subset of the physical processing units in a multiprocessor computer. Depending upon the topology of the machine, thread affinity can have a dramatic effect on the execution speed of a program.
There are 2 considerations for OpenMP threading and affinity: First, determine the number of threads to utilize, and secondly, how to bind threads to specific processor cores.
First, the Intel® Xeon Phi™ Coprocessor supports 4 thread contexts per core. So an initial consideration is how many application threads are optimal for this processor? This will depend on your application. In general, more threads help to hide latencies inherent in your application (while 1 thread is stalled waiting for memory, another 1-3 threads could schedule on the processor). On Intel® Xeon® Architecture, users have found that cpu-intensive HPC applications GENERALLY do not benefit from hyperthreading. This is NOT true on The Intel® Many Integrated Core Architecture (Intel® MIC Architecture). Thus, it is important to attempt to use some number of the available 4 thread contexts available on the Intel® Xeon Phi™ Coprocessor.
The best advice for an offload program: try different numbers of threads from N-1 threads to 4x(N-1) threads where N is the number of physical cores on the processor. Four simple experiments can be run: run the application with N-1 threads. Run with 2x(N-1), 3x(N-1) and 4x(N-1) to determine if the addition thread contexts give a performance benefit to your application. Why N-1 instead of N? OS overhead - the OS and MPSS threads do take processor cycles and it is inefficient to schedule worker threads on cores where the OS threads are contenting for cycles. This statement naturally leads to the second consideration: how to place threads on cores of the Intel® Xeon Phi™ coprocessor. Specifically, what core are the OS threads using and how can I avoid scheduling threads on that core used by OS threads? Also, for an application running natively, you may want to try all available threads (4xN) since the OS overhead is lower in that case. Note that the default values of OpenMP parameters vary between offloaded and native execution.
Affinity and KMP_AFFINITY environment variable
If you are not familiar with OpenMP thread affinity and the use of KMP_AFFINITY environment variable, please first familiarize yourself with the concepts from your compiler documentation. First read the Intel Compiler XE User and Reference Guides (sections titled "Thread Affinity Interfaces" and "Programming for Intel® MIC Architecture"). Here is the ONLINE doc for Thread Affinity Interface and make sure you fully understand COMPACT, SCATTER, and BALANCED affinity types.
If you do not set a value for KMP_AFFINITY, the OpenMP runtime is allowed to choose affinity for you. The value chosen depends on the CPU architecture and may change depending on what affinity is deemed most efficient FOR A VARIETY OF APPLICATIONS for that architecture. It may or may not be ideal for any particular application, however. In addition, the affinity settings may change from one compiler version to another. Thus, the advice is that if your application can take advantage of a certain affinity setting then you should explicitly specify that setting. Otherwise you will get a setting that is generally efficient for a wide variety of applications.
As of early 2013 with the Intel Composer XE 2013 Update 1, the following DEFAULT affinity settings are set by the runtime. These may change at any future release, so the best advice is to explicitly set affinity. The following show the affinity defaults for both HOST and PHI:
- Xeon Host OMP / No offload: 'none' / NA
- Xeon Host serial / Offload OMP: NA / 'granularity=fine,scatter'
- Xeon OMP / Offload OMP: 'none' / 'granularity=fine,scatter'
- No host / Native compiler OMP: NA / 'granularity=fine,scatter'
Also note that for an offload program, the total number of cores “seen” by the OMP runtime is one less than the maximum (configurations 2 and 3 in the list above), since the last core is reserved for OS processes. This is in contrast to the “native” mode (configuration 4 in the list above) where all cores are available to the OMP runtime and the default uses all available threads across all the cores.
If you want to use the last core also for an offload program (say by forcing OMP_NUM_THREADS to 4*N where N is the number of cores), make sure you add the “norespect” clause as part of the KMP_AFFINITY setting. Without this, there will be over-subscription since the “respect” clause is also part of the default settings.
If you run OpenMP regions on the host AND offload OMP to Phi, remember that you have 2 OpenMP runtimes and you will need to set KMP_AFFINITY for each. To do this:
For the host OMP Runtime: use env var KMP_AFFINITY
To set affinity for the offload runtime, you need to pass that affinity setting down to the Phi runtime. This is done by setting up a separate env variable for that runtime. This can be done like this:
export PHI_KMP_AFFINITY='...your affinity settings of choice...'
If you are very familiar with OpenMP thread affinity control and just want a quick advice of the processor mapping, for MIC try these two options for KMP_AFFINITY:
TIP: use the VERBOSE modifier on KMP_AFFINITY to get a detailed list of bindings. Example:
TIP: Note that the coprocessor micro OS runs threads run on "OS proc 0" which maps to your highest numbered core. For example, in a 61 core Phi, cores are numbered 0..60. Using 'verbose' modifier will show:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 60 thread 0
Thus, a handy performance tip is always leave the highest-numbered core free from application threads to prevent interference from OS threads.
The environment variable KMP_PLACE_THREADS was added to Intel Composer XE 2013 Update 2 ( aka compiler 184.108.40.206 Build 20130121 ) to help simplify thread placement on Intel(R) Xeon Phi(TM) Coprocessors (this variable is not used by the Intel OpenMP Runtime for Xeon). This is particularly helpful on Intel(R) Xeon Phi(TM) Coprocessor enabled systems since these systems have many cores and 4 thread contexts available per core. Users demand the ability to specify a subset of cores to run threads, number of threads per core, offset from the first core, etc. This can be done with KMP_AFFINITY using explicit affinity settings, however the explicit bindings list can be cumbersome and error-prone. KMP_PLACE_THREADS was created to give programmer flexibility in thread placement in a simple, compact, and easy to understand syntax.
NOTE: KMP_PLACE_THREADS DOES NOT REPLACE KMP_AFFINITY! These two environment variables work together, as we'll see. The syntax is shown below:
KMP_PLACE_THREADS Environment Variable (13.1.0)
This environment variable allows the user to simplify the specification of the number of cores and threads per core used by an OpenMP application, as an alternative to writing explicit affinity settings or a process affinity mask.
value = ( int [ "C" | "T" ] [ delim ] | delim ) [ int [ "T" ] [ delim ] ] [ int [ "O" ] ];
Intel® Fortran Composer XE 2013 for Linux* Installation Guide and Release Notes
"int" is a simple integer constant
"delim" is either a comma "," or a lower case x "x"
Specifies the number of cores, with optional offset value and number of threads per core to use.
- "C" indicates Cores,
- "T" indicates Threads
- "O" (letter O, not zero) is used to specify an Offset. Offset ignores granularity, Offset is the number of Cores to offset, starting from 0 "Core 0". Thus 1O would be the 2nd core in the package, aka "Core 1". Default is 0O
5C,3T,1O - use 5 cores with offset 1, 3 threads per core
5,3,1 - same as above
24 - use first 24 cores, 4 threads per core
2T - use all cores, 2 threads per core
,2 - same as above
3x2 - use 3 cores, 2 threads per core
4C,12O - use 4 cores with offset 12, all available threads per core
If you look carefully at the definition of KMP_PLACE_THREADS you will note that it's used to specify THE TOPOLOGY of the system to the OpenMP runtime. Note that it says nothing about how the threads are bound within that topology. This is where KMP_AFFINITY is used in conjunction with KMP_PLACE_THREADS. Specifically, the user should consider the COMPACT, SCATTER, and BALANCED affinity types. Some examples might help:
This sets up a topology on a Phi using the first 60 cores, 3 threads per core, 180 threads total using COMPACT binding of threads. Some other useful combinations found to be useful for various applications. This assumes 61 core Phi:export MIC_ENV_PREFIX=PHI:
Note we use the same 60 core at 3 threads per core topology. Here we undersubscribe this topology with just 120 threads (PHI_OMP_NUM_THREADS), distribute (bound) BALANCED. This places threads 0 and 1 on the first core, 2 and 3 on the second, ... threads 118, 119 on the 60th core.
Note that we used a topology of 60 cores out of 61 possible (number of cores depends on your Phi product type, 61 is max core count as of this article). This is to leave the uppermost core free to run system threads and not interfere with compute thread synchronization.
Offset is useful to partition subsets of cores for process use. This is helpful when you want to share the Phi cores between separate processes. For example: If we have a 61 core Phi, we decide to use 60 cores for an application, all 4 threads per core. The application is an offload application on the host with 2 processes. We want to use cores 0..29 for process 1. We want to use cores 30..60 for process 2. Here is an example of doing this:
- Process 1 offload environment var setup:
- Process 2 offload environment var setup:
This is how Offset can be used to effectively partition the Phi cores and prevent processes from colliding on the same cores.
Here are some generally useful configurations:
1 thread per core:
2 threads per core, balanced (generally the best affinity type for a generic application)
3 threads per core
4 threads per core
Building Affinity Into an Executable:
One downside of using environment variables is one has to remember to set them before a run (either at the command line or via a run or batch script or 'dot' file). For Native applications built with -mmic, the compiler option -par-affinity can be used to fix an affinity setting for an executable. Building a Natve Application with this option allows the user to specify affinity at compile time. The user does not need to set KMP_AFFINITY in this case. If KMP_AFFINITY is used, it is overridden by the setting specified in -par-affinity, effectively ignoring the environment variable KMP_AFFINITY.
Performance Tip: KMP_BLOCKTIME Parameter
Another OMP parameter that can affect perfomance is KMP_BLOCKTIME. This affects OMP runtime settings on how long a thread waits before going to sleep - the default value for this parameter is 200 (milliseconds). OpenMP keeps its thread pool reserved for that interval in the expectation that any parallel region will be another OpenMP region. You may want to try the values of KMP_BLOCKTIME=<0, 50, infinite> to see if that improves performance. Typically value 0 should help cases where there is a lot of load imbalance. (Note that for an offload application, you may have to use MIC_KMP_BLOCKTIME environment variable along with MIC_ENV_PREFIX=MIC to set the value properly).
Performance Tip: -opt-threads-per-core Option
If you use an OMP affinity setting that utilizes less than 4 threads/core, you can add the option -opt-threads-per-core=1/2/3/4 to tell the compiler to make better scheduling decisions. See the article here for more details: Scheduling for Multiple Threads on Intel® MIC Architecture
The OpenMP runtime provides mechanisms to bind OpenMP worker threads to specific processors. The Intel® Xeon® Phi™ coprocessor supports 4 thread contexts per processor. Generally, users see application benefit from using a good portion of these thread contexts. Thus, you should experiment with creating from N-1 to 4x(N-1) threads on the Intel® Xeon Phi™ coprocessor, where N is the number of cores. Four simple experiments can be run: run the application with N-1 threads. Run with 2x(N-1), 3x(N-1) and 4x(N-1) to determine if the addition thread contexts give a performance benefit to your application. Each application will find an ideal number of threads. Next, consider avoiding conflicts with OS threads by avoiding the core running OS threads. Some sample settings to try:
It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.