topology "levels" definitions for KMP affinity

topology "levels" definitions for KMP affinity

I noticed that the KMP_PLACE_THREADS variable (newly supported on non-Xeon-Phi platforms starting with the version 16 compilers) refuses to work on systems with HyperThreading disabled in the BIOS.   At run time KMP_AFFINITY=verbose option results in printing the error message:

OMP: Warning #246: KMP_PLACE_THREADS ignored: only three-level topology is supported.

This seems strange to me.  KMP_PLACE_THREADS was initially developed to work on Xeon Phi (Knights Corner), where the topology is two-level --- cores and threads.  The Intel 16 compiler documentation for KMP_PLACE_THREADS says that it supports a 3-level topology -- sockets, cores, threads, but does not say that the option will be ignored if any of those levels are missing.   

Does this also mean that KMP_PLACE_THREADS will not work on a single-socket system with a two-level [cores,threads] topology?   (Like, for example, a Xeon Phi?)

There is (to me) a clear parallel between the original use of KMP_PLACE_THREADS for [cores,threads] on Xeon Phi (KNC) and the use of KMP_PLACE_THREADS for [sockets,cores] on mainstream Xeon systems.   It does not seem like a third level of topology is required for KMP_PLACE_THREADS to be useful, and it does seem like truncating the lowest level (by disabling HyperThreading) should be something that the topology routines could deal with....

I understand that there are other KMP environment variables that may provide the same distribution as KMP_PLACE_THREADS, but it would be much easier for me to change the contents of the KMP_PLACE_THREADS variable for HT-enabled vs HT-disabled systems, rather than using entirely different environment variables to control the placement....

 

"Dr. Bandwidth"
7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi John.

What you described is obviously a bug.  We will fix it in the next compiler release (17.0 gold coming to public this summer, and later compilers of cause, including future updates of 16 compiler).  At the same time the variable name KMP_PLACE_THREADS will be deprecated and replaced by KMP_HW_SUBSET. This is done because people complained on the old name as confusing (e.g. the PLACE is now official term of the OpenMP specification), the old name will still work for some time. Also the KMP_HW_SUBSET will work with KMP_AFFINITY=none, that is not possible in current releases.

BTW, single socket still forms a level of affinity from OpenMP runtime point of view, that's why the variable works on PHI.  But the absence on hyperthreads reduces number of levels to two.

Regards, 
Andrey

In the recently released beta compiler, the documentation still advises use of KMP_PLACE_THREADS, but that apparently has no effect (on MIC) except to produce a message about requiring the new undocumented name.

Thanks for the quick response!

I was hoping for this one to be usable immediately because KMP_PLACE_THREADS was very helpful on Xeon Phi (KNC) and we have at least four different core numbering conventions across our many Intel-based systems.  This makes it difficult to write scripts that set up the same affinity on different systems.   I have not really looked at the recent OpenMP extensions to see if they have the right level of abstraction to do what I need....

For general amusement, here are the core numbering schemes on different TACC systems:

  • Without HyperThreading (Stampede, Maverick, Hikari)

    • "Interleaved" distribution: alternate even/odd logical processors across the two sockets
    • "Blocked" distribution: logical processors 0..N/2-1 on socket 0, logical processors N/2..N-1 on socket 1
    • "Double-Block" distribution:
      • logical processors 0..(N/4-1) and (N/2)..(3*N/4-1) on socket 0
      • logical processors (N/4..N/2-1) and (3*N/4)..(N-1) on socket 1
    • Stampede
      • login nodes use "interleaved" distribution
      • compute nodes use "Blocked" distribution
    • Maverick -- both login and compute nodes use "blocked" distribution
    • Hikari
      • login nodes are 2x10-core with a "Double-Block" distribution
      • compute nodes are 2x12-core with a "Double-Block" distribution
  • With HyperThreading (Lonestar5, Wrangler)
    • Let "N" be the number of Logical Processors in the 2-socket system

      • N/2 is the number of Physical cores in the 2-socket system
      • N/4 is the number of Physical cores per socket
    • "Two-Pass Blocked"
      • Logical processors 0..N/4-1 map to thread context 0 on each of the N/4 physical cores in socket 0
      • Logical processors N/4..N/2-1 map to thread context 0 on each of the N/4 physical cores in socket 1
      • Logical processors N/2..3*N/4-1 map to thread context 1 on each of the N/4 physical cores in socket 0
      • Logical processors 3*N/4..N-1 map to thread context 1 on each of the N/4 physical cores in socket 1
    • "Two-Pass Interleaved"
      • Logical processors 0..N/2-2 by 2 map to thread context 0 on each of the N/4 physical cores in socket 0
      • Logical processors 1..N/2-1 by 2 map to thread context 0 on each of the N/4 physical cores in socket 1
      • Logical processors N/2..N-2 by 2 map to thread context 1 on each of the N/4 physical cores in socket 0
      • Logical processors N/2+1..N-1 by 2 map to thread context 1 on each of the N/4 physical cores in socket 1
    • Lonestar5
      • login nodes have HyperThreading disabled and use an "Interleaved" distribution
      • compute nodes use the "Two-Pass Blocked" distributiong
    • Wrangler -- both login and compute nodes use the "Two-Pass Interleaved" distribution
  • Xeon Phi
    • Knights Corner coprocessors

      • distribution is too complicated to explain here
    • Knights Landing
      • Details "Real Soon Now"
      • Based on public disclosures it is clear that there must be different numbering schemes for the single-NUMA-node configuration and the four-NUMA-node configuration

That gets us to at least 8 different core numbering schemes across these five systems.   Blecch!

"Dr. Bandwidth"

I see that the newer library puts out the message about requiring the change to KMP_HW_SUBSET but does not mention the requirement to open a new shell, as the old name is not simply deprecated, but prevents OpenMP from running.

My intel Xeon phi installation is old enough that it probably is not supported.  While launching native openmp in busybox I could not find a satisfactory way to set affinity with the libiomp5 of the 17.0 beta update 1 other than by closing the shell and opening a new one with the new wording.  I am now in an area without wired internet several hours travel from that machine.

KMP_PLACE_THREADS and KMP_HW_SUBSET are two names for the same setting.  The former is somewhat ambiguous, implying the placement of threads, which is not what this setting does.

The old name, KMP_PLACE_THREADS, should continue to work.  I am still using it, with the beta 2017 and even latest internal compilers.

 

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today