Interpretation of environment variables for reductions

Interpretation of environment variables for reductions

While browsing the OpenMP runtime source, I ran across a number of undocumented environment variables and one environment variable that appears to be spelled incorrectly in the Intel 13.1 compiler documentation.

(1) In "kmp_settings.c", I see that the variable KMP_FORCE_REDUCTION can be set to "critical", "atomic", or "tree".  

  • "tree" appears to be the default.
  • "critical" is very slow (50x slower on 240 threads)
  • "atomic" is quite a bit faster than the default on Xeon Phi SE10P

Using the EPCC OpenMP_Bench_C_v2/syncbench.c test on Xeon Phi SE10P, I measure about 33% lower overhead with "atomic" on 60 cores, whether using 1, 2, 3, or 4 threads per core.   Is this expected?   Is there a reason why this should not be the default?

(2) The variable that is referred to as "KMP_DETERMINISTIC_REDUCTIONS" in the Intel 13.1 compiler guide online is actually "KMP_DETERMINISTIC_REDUCTION" in the source code (again in "kmp_settings.c").    With either spelling, setting the variable to 1 does not change the performance reported by the EPCC OpenMP syncbench code.   The correct spelling appears to be KMP_DETERMINISTIC_REDUCTION, since this is the variable that is printed out when I run a job with the KMP_SETTINGS variable set to "1" --- when I set KMP_DETERMINISTIC_REDUCTION to 1, the runtime prints "KMP_DETERMINISTIC_REDUCTION=true".

Is it expected that the deterministic reduction gives the same performance as the default?    In general the Intel compilers are very aggressive with reordering of operations inside OpenMP reduction clauses, so it seems odd that the default algorithm would be the deterministic algorithm rather than the fastest algorithm.

(3) In "kmp_settings.c" there are environment variables "KMP_REDUCTION_BARRIER" and "KMP_REDUCTION_BARRIER_PATTERN".  These are also printed out by the runtime when I set "KMP_SETTINGS=1".    The available values for the KMP_REDUCTION_BARRIER_PATTERN" are "linear", "tree", and "hyper".   The default value is "hyper,hyper".   The KMP_REDUCTION_BARRIER variable takes two numerical arguments, with a default value of "1,1".

I did not find that changing these values made any noticeable difference to the reduction operation overhead reported by the EPCC OpenMP syncbench benchmark.

Is there a simple set of words that can describe which part of the reduction operation these refer to and what the first and second arguments correspond to?

(4) Repeat question (3) for "KMP_PLAIN_BARRIER" and "KMP_FORKJOIN_BARRIER", which have the same structure (but slightly different defaults on my Xeon Phi SE10P).

Thanks for any comments!

John D. McCalpin, PhD
"Dr. Bandwidth"
4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi John,

Incorrect spelling of the environment variable KMP_DETERMINISTIC_REDUCTION was fixed in 14.0 compiler, that you can see in the documentation http://software.intel.com/en-us/node/459658.

The default reduction method was chosen according to results received during performance testing on hadware available at that moment. Since that time changes in hadware/software could affect performance.

If you look deeper into the code you can see that the reduction method depends on the number of threads in a team. For small number of threads "atomic" is used whereas for large number of threads (>8 on Xeon Phi) "tree" is used. 

"Tree" reduction is deterministic by nature, so it is used whenever a user set KMP_DETERMINISTIC_REDUCTION=true.

Thank you.

Some more answers:

(1) and (2) were answered by Olga. We will consider changing the default settings for reduction on Xeon Phi. The problem here is that the library has no information on the type of reduced data, so what is better for integer data may be worse for floating.  Anyway, thank you for pointing this out.

(3) and (4) are talking about low level adjustment of library behaviour on barriers. Let me shed a bit light here. Each barrier has two phases in our implementation - join or gather phase and fork or release phase. Thus each variable has two values to be specified, first for the gather phase of a barrier, second - for the release phase. If KMP_*_BARRIER_PATTERN has value "linear" for some phase, then corresponding value in KMP_*_BARRIER is ignored, as this variable affects only tree or hyper kind of a barrier. The KMP_REDUCTION_BARRIER* variables affect reduction operation in case the library decided that the kind of reduction should be "tree". The settings are ingored otherwise.  For tree reduction the barrier is executed, and KMP_REDUCTION_BARRIER=1,2, for example, means that gather phase of the barrier will be structured as binary tree that is threads will be grouped in pairs (parent+child), and on release phase threads will be grouped in quartets (parent+ 3 children). For value 3 threads will be grouped in octets, etc. So the value is the degree of 2 and specify the width of subtrees. Each parent thread accumulates private data of its children, in each group, on each level of the tree on gather phase of the reduction barrier. The master thread of the team finally updates the shared location of reduction variable after the reduction barrier.

Note also that these low level tuning possibilities are not documented, that means they can be changed or removed at any time.

Regards, 
Andrey
 

Thanks very much for the overview of the KMP_*_BARRIER variables -- this is exactly what I was looking for!

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today