Compiler Methodology for Intel® MIC Architecture
Efficient Parallelization, OpenMP Thread Affinity Control
The Intel® OpenMP* runtime library has the ability to bind OpenMP threads to physical processing units. The interface is controlled using the KMP_AFFINITY environment variable. Depending on the system (machine) topology, application, and operating system, thread affinity can have a dramatic effect on the application speed.
Thread affinity restricts execution of certain threads (virtual execution units) to a subset of the physical processing units in a multiprocessor computer. Depending upon the topology of the machine, thread affinity can have a dramatic effect on the execution speed of a program.
There are 2 considerations for OpenMP threading and affinity: First, determine the number of threads to utilize, and secondly, how to bind threads to specific processor cores.
First, the Intel® Xeon Phi™ Coprocessor supports 4 thread contexts per core. So an initial consideration is how many application threads are optimal for this processor? This will depend on your application. In general, more threads help to hide latencies inherent in your application (while 1 thread is stalled waiting for memory, another 1-3 threads could schedule on the processor). On Intel® Xeon® Architecture, users have found that cpu-intensive HPC applications GENERALLY do not benefit from hyperthreading. This is NOT true on The Intel® Many Integrated Core Architecture (Intel® MIC Architecture). Thus, it is important to attempt to use some number of the available 4 thread contexts available on the Intel® Xeon Phi™ coprocessor.
Good advice: try different numbers of threads from N-1 threads to 4x(N-1) threads where N is the number of physical cores on the processor. Four simple experiments can be run: run the application with N-1 threads. Run with 2x(N-1), 3x(N-1) and 4x(N-1) to determine if the addition thread contexts give a performance benefit to your application. Why N-1 instead of N? OS overhead - the OS and MPSS threads do take processor cycles and it is inefficient to schedule worker threads on cores where the OS threads are contenting for cycles. This statement naturally leads to the second consideration: how to place threads on cores of the Intel® Xeon Phi™ coprocessor. Specifically, what core are the OS threads using and how can I avoid scheduling threads on that core used by OS threads?
Affinity and KMP_AFFINITY environment variable
If you are not familiar with OpenMP thread affinity and the use of KMP_AFFINITY environment variable, please first familiarize yourself with the concepts from your compiler documentation. First read the Compiler XE User and Reference Guides (sections titled "Thread Affinity Interfaces" and "Programming for Intel® MIC Architecture"). Here is the ONLINE doc for Thread Affinity Interface
If you are very familiar with OpenMP thread affinity control and just want a quick advice of the processor mapping, start here: a quick overview on the OMP affinity settings for MIC
For the definitive guide to OpenMP usage on Intel® Xeon® processors Phi™ coprocessor, read the white paper Best Known Methods for Using OpenMP* on Intel® Many Integrated Core (Intel® MIC) Architecture [PDF 259KB] (This document is hosted on the MIC-DEV. Insure that you have access to this web portal)
While not specifically about thread affinity, many advanced users are curious about OpenMP thread pool thread creation and thread join latencies. For information on this topic, read the white paper OpenMP* Thread Pools with Intel® Many Integrated Core (Intel® MIC) [PDF 250KB] (This document is hosted on the MIC-DEV. Insure that you have access to this web portal)
The OpenMP runtime provides mechanisms to bind OpenMP worker threads to specific processors. The Intel® Xeon® Phi™ coprocessor supports 4 thread contexts per processor. Generally, users see application benefit from using a good portion of these thread contexts. Thus, you should experiment with creating from N-1 to 4x(N-1) threads on the Intel® Xeon Phi™ coprocessor, where N is the number of cores. Four simple experiments can be run: run the application with N-1 threads. Run with 2x(N-1), 3x(N-1) and 4x(N-1) to determine if the addition thread contexts give a performance benefit to your application. Each application will find an ideal number of threads. Next, consider avoiding conflicts with OS threads by avoiding the core running OS threads. Some sample settings to try:
It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.