While browsing the OpenMP runtime source, I ran across a number of undocumented environment variables and one environment variable that appears to be spelled incorrectly in the Intel 13.1 compiler documentation.
(1) In "kmp_settings.c", I see that the variable KMP_FORCE_REDUCTION can be set to "critical", "atomic", or "tree".
- "tree" appears to be the default.
- "critical" is very slow (50x slower on 240 threads)
- "atomic" is quite a bit faster than the default on Xeon Phi SE10P
Using the EPCC OpenMP_Bench_C_v2/syncbench.c test on Xeon Phi SE10P, I measure about 33% lower overhead with "atomic" on 60 cores, whether using 1, 2, 3, or 4 threads per core. Is this expected? Is there a reason why this should not be the default?
(2) The variable that is referred to as "KMP_DETERMINISTIC_REDUCTIONS" in the Intel 13.1 compiler guide online is actually "KMP_DETERMINISTIC_REDUCTION" in the source code (again in "kmp_settings.c"). With either spelling, setting the variable to 1 does not change the performance reported by the EPCC OpenMP syncbench code. The correct spelling appears to be KMP_DETERMINISTIC_REDUCTION, since this is the variable that is printed out when I run a job with the KMP_SETTINGS variable set to "1" --- when I set KMP_DETERMINISTIC_REDUCTION to 1, the runtime prints "KMP_DETERMINISTIC_REDUCTION=true".
Is it expected that the deterministic reduction gives the same performance as the default? In general the Intel compilers are very aggressive with reordering of operations inside OpenMP reduction clauses, so it seems odd that the default algorithm would be the deterministic algorithm rather than the fastest algorithm.
(3) In "kmp_settings.c" there are environment variables "KMP_REDUCTION_BARRIER" and "KMP_REDUCTION_BARRIER_PATTERN". These are also printed out by the runtime when I set "KMP_SETTINGS=1". The available values for the KMP_REDUCTION_BARRIER_PATTERN" are "linear", "tree", and "hyper". The default value is "hyper,hyper". The KMP_REDUCTION_BARRIER variable takes two numerical arguments, with a default value of "1,1".
I did not find that changing these values made any noticeable difference to the reduction operation overhead reported by the EPCC OpenMP syncbench benchmark.
Is there a simple set of words that can describe which part of the reduction operation these refer to and what the first and second arguments correspond to?
(4) Repeat question (3) for "KMP_PLAIN_BARRIER" and "KMP_FORKJOIN_BARRIER", which have the same structure (but slightly different defaults on my Xeon Phi SE10P).
Thanks for any comments!