Reduction operation

Reduction operation

Hello,

I would like to understand the behavior of the OpenMP reduction in the Intel OMP Runtime.

Taking a look to the code, I found out that in the source file "kmp_csupport.c" there are 2 functions "__kmpc_reduce_nowait" and "__kmpc_reduce" that, in my understanding, select the reduction method to apply to the data, but after that I don't understand how the results are combined.

Please, could somebody give me some guidelines in order to find the right point where the final results of the reduction is computed?

Thanks a lot for your help.

Best Regards,

Simone

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Simone,

The final result of the reduction is computed outside of the OpenMP RTL by code generated by the compiler.

Regards,
Andrey

Hello Andrey,

Thanks for you clarification, that was also my guess.

So, from the runtime we can only understand what kind of reduction will be applied (critical, atomic, tree)?

Is there no way to get information about the thread scheduling during the reduction operation, chunk size for each thread, operation (sum, product, ...) and reduction variable type?

I can probably get information about the scheduling in the section where the reduction is transformed in parallel code, but what about the other information?

Thanks

Regards,

Simone

 

Hi Simone,

The library gets information form compiler about the generated reduction methods, the number of reduction variables and total size of the data, no info on data type(s) or reduction operation(s). The library then decides what particular method should be used from the list of methods provided by compiler, reports the method to compiler, and acts depending on the method.  I agree that knowing the data type(s) and operation(s) could help to choose the optimal reduction method, but that is not covered by current design of the interface between the library and the compiler.

I am not sure what did you mean by thread scheduling and chunk size, this terms are not applicable for the reduction because it happens once per thread of the team at the end of OpenMP construct.

Regards,
Andrey

Thanks again Andrey,

that really helps!

For getting information about threads scheduling and chunk size, I meant this: the reduction construct is transformed in some way in a threaded program, as the reference guide says, if we have a program like this:

extern float foo( void );

int main () {
    int i;
    float r = 0.0;
    #pragma omp parallel for schedule(dynamic) reduction(+:r)
    for ( i = 0; i < 10; i ++ ) {
r += foo(); }

the transformed code looks like this:

extern float foo( void );
int main () {
    static int zero = 0;
    auto int gtid;
    auto float r = 0.0;
    __kmpc_begin( & loc3, 0 );
    // The gtid is not actually required in this example so could be omitted;
    // We show its initialization here because it is often required for calls into
    // the runtime and should be locally cached like this.
    gtid = __kmpc_global thread num( & loc3 );
    __kmpc_fork call( & loc7, 1, main_7_parallel_3, & r );
    __kmpc_end( & loc0 );
    return 0;
}
struct main_10_reduction_t_5 { float r_10_rpr; };
static kmp_critical_name lck = { 0 };
static ident_t loc10; // loc10.flags should contain KMP_IDENT_ATOMIC_REDUCE bit set
                      // if compiler has generated an atomic reduction.
void main_7_parallel_3( int *gtid, int *btid, float *r_7_shp );

Where the function “main_7_parallel_3” is the thread function the will execute a part of the reduction, so it will have some lower and upper in order to index the data to compute (chunk indexes), a threadId, a pointer to the data, etc.

My question is if it’s possible to know how these threads are scheduled (the order how they reduce, in other word the order how they give back their computed result to the main thread that will aggregate the final result). Does it make sense?

For example in the function “__kmp_for_static_init” of kmp_sche.cpp, seems that it is computing the lower and upper bound of the chunk for that thread (tid), but I imagine that the order of threads for which “__kmp_for_static_init” is called is not the same of how the reduction combines the results, right?

Thanks again,
Best Regards,
Simone

Hi Simone,

In general case it is possible to know the order of operations in each thread via looking at thread id (omp_get_thread_num()) in each iteration of the loop, so you can see which data each thread accumulates into its local copy of reduction variable.  But it is not possible to know the order of combining thread's final data into shared variable.  If the library chose "atomic" reduction method, then thread's data are accumulated into shared variable using atomic operation, and it is not generally possible to know the order of atomic operations.

Though it is possible to use deterministic reduction (e.g. set KMP_DETERMINISTIC_REDUCTION=true), then you can be sure that the order of combining thread's final results into shared final result is the same as the order of threads gathering at the reduction barrier. In this case the reduction method will be "tree", so the reduction function (e.g. __kmpc_reduce_nowait()) will call __kmp_barrier, that will call __kmp_hyper_barrier_gather, that will call reduction routine generated by compiler in particular order - each parent thread will combine its own local final result with the local final results of its children. Thus all partial results will be eventually accumulated into master's copy of the reduction variable in particular order. After master thread returns from __kmpc_reduce_nowait it will accumulate its reduction result into shared variable (in the code generated by the compiler inside main_7_parallel_3 routine in your example). If the loop shcedule is static, then it is guaranteed that the result of reduction will be the same in defferent runs of application. If the loop schedule is dynamic, then only second phase of reduction can be deterministic - accumulation of threads' final results.  But the first phase - accumulation iside a thread - cannot be deterministic.

So knowing the order of partial results accumulation in each thread and then between threads, you can figure out the total order of reduction (of cause only in case of deterministic reduction).

Regards,
Andrey

Thanks a lot Andrey! This clarify all my doubts.

I have just one last simple question: how can I activate the debug mode of the runtime? I mean I'd like to print all debug messages that are inside the code like "KA_TRACE(...)", I have been trying defining some DEBUG variable but nothing happen.

Thanks

Best,

Simone

To see KA_TRACE() prints you should set environment variable KMP_A_DEBUG=<value> (the bigger value the more prints you will see). Or, similarly, to see KD_TRACE() you need to set KMP_D_DEBUG=100 (or other value depending on what number of details you want to see). E.g. KMP_A_DEBUG=100 will enable all traces of kinds KA_DEBUG(10,...), KA_DEBUG(20,...), ... ,KA_DEBUG(100,...), but not KA_DEBUG(1000,...). Of cause all prints are enabled only in debug build of the library (e.g. built using command "make mode=debug").

Regards,
Andrey

Leave a Comment

Please sign in to add a comment. Not a member? Join today