Thread affinity in offload codes

Thread affinity in offload codes

Ronald Green's paper on thread affinity control ( ) mentions that MPSS uses some of coprocessor's cores to manage the offload processes. The paper recommends to set the number of threads to 4*(N-1) in offload applications in order to free up 4 logical cores for MPSS.

However, I remember that older documents suggested that some specific logical cores are allocated for MPSS. It was cores 0, 1, 238 and 239 for a 60-core coprocessor, or something of that sort. Is this still the case in MPSS Gold and later?

What I want to know is — for best performance, should I set MIC_KMP_AFFINITY=explicit,proclist={something specific}, or will it be enough to just set MIC_OMP_NUM_THREADS=236 and MIC_KMP_AFFINITY=balanced/compact/scatter (depending on the application)?

9 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Logical processors 0 and the highest 3 numbers are associated with the core which is preferred by MPSS for its own threads.  It should no longer be necessary to set a full proclist for the usual cases which are covered by the MIC_KMP_PLACE_THREADS environment variable e.g. MIC_KMP_PLACE_THREADS=59C,4t to use 59 cores with 4 threads/core.  MIC_KMP_AFFINITY=compact|balanced work along with this setting.  balanced without the PLACE_THREADS is likely to set threads on the core you wish to avoid.  Adding the verbose option to MIC_KMP_AFFINITY will give you a list of the effective settings.

I assume you're aware of the MIC_PREFIX=MIC_ setting for enabling the scheme to have separate settings for host and MIC offload.

Thank you, Tim! The "verbose" option is a great suggestion. It gives a lot of useful info and can probably help me to answer my own question.

I inspected the output with "MIC_KMP_AFFINITY=balanced,verbose" and found that regardless of whether I set MIC_KMP_PLACE_THREADS=59C,4t or leave it blank, I get the same mapping of threads to procs. However, it does not seem to be important: with or without MIC_KMP_PLACE_THREADS, the offloaded code maps the OS procs only to cores 0 through 58 (not 59). This is the output that I get with MIC_KMP_AFFINITY=balanced,verbose and MIC_KMP_PLACE_THREADS="" on a 60-core coprocessor:

OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236}

OMP: Info #156: KMP_AFFINITY: 236 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 1 packages x 59 cores/pkg x 4 threads/core (59 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 0 thread 2


OMP: Info #171: KMP_AFFINITY: OS proc 233 maps to package 0 core 58 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 234 maps to package 0 core 58 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 235 maps to package 0 core 58 thread 2
OMP: Info #171: KMP_AFFINITY: OS proc 236 maps to package 0 core 58 thread 3
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {5}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {9}


OMP: Info #147: KMP_AFFINITY: Internal thread 232 bound to OS proc set {224}
OMP: Info #147: KMP_AFFINITY: Internal thread 233 bound to OS proc set {228}
OMP: Info #147: KMP_AFFINITY: Internal thread 234 bound to OS proc set {232}
OMP: Info #147: KMP_AFFINITY: Internal thread 235 bound to OS proc set {236}

So, correct me if I am wrong: this output suggests that MPSS is actually already smart enough to avoid scheduling calculations on its designated core. Or are there situations where this will not be the case?

P.S.: if the variable MIC_KMP_PLACE_THREADS is documented in any public resources, could you please point me to them?

Yes, in offload mode, by default, the core used by OS is avoided. There is an override option for KMP_AFFINITY.

I think it would be helpful if the current references would be posted at the top of this forum (or a handy search applet); now that you mention it, I'm having difficulty searching for them.   Does this content index work for you?

After a few more searches, I did find:

Several of the presentations are offered with both video and .pdf text at

Treatment of OpenMP and MPI environment variables appears not to be entirely up to date.

KMP_PLACE_THREADS is documented in the release notes for compiler 2013 update 2.

By the way, there's also an obscure reference to 2 of the new OpenMP 4 features being implemented in the simd and target categories:

I've searched unsuccessfully for adequate documentation.

is another recent bit of important documentation.

Thank you for all the good information, Tim! Do you know whether the documentation for Intel C++ Compiler XE 2013 Update 2 that you mentioned is available publicly? A lot of the features that I find in white papers and on forums are not documented in the currently available reference (this).

The following reference refers to offload use of KMP_PLACE_THREADS:

I'm running an OpenMP 4 test program for publication in which


gives me typically 50% performance boost over leaving these environment variables unset.  A majority of individual tests perform better at 59c,1t in offload mode, although they scale to 2 threads/core when running native on MIC.

For those who are interested, MKL automatic offload, with large enough data sets, the default of using (nearly) all the MIC hardware threads by leaving the environment variables unset is best.  On the other hand, I find it unusual to be able to use 4 threads per MIC core effectively with compiled source code, or to get effective use of the core which runs the MPSS.  When I watch this benchmark with micsmc-gui, both core 0 and the last core are occupied by system activity, while user activity is spread evenly across cores 1-58.   Note that default placement in offload mode reserves a core for system activity, unlike MIC native mode, but it still pays to reserve user activity to a slightly smaller number of cores.

I have my test benchmark working with ifort 14.0.2.  I'm still working on a version where the omp target (offload) work is done from C++, using C++ because inner_product() is more effective on MIC than omp simd reduction(+:  ) (while another benchmark kernel works well with Fortran or C but not so well with C++ STL).

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui