Avoiding unnecessary OpenMP synchronization

Avoiding unnecessary OpenMP synchronization

Say I wish to add a number of vectors:

cblas_daxpy(n, 1.0, a, 1, b, 1);

cblas_daxpy(n, 1.2, c, 1, d, 1);

cblas_daxpy(n, 1.4, e, 1, f, 1);

cblas_daxpy(n, 1.6, g, 1, h, 1);

MKL will use OpenMP to parallelize each of these vector additions internally. However all of the OpenMP threads will sync up between each daxpy call, adding overhead. Since I know that the functions are independent of each other, this synchronization is unnecessary.

I could do

#pragma omp parallel sections

{

#pragma omp section

cblas_daxpy(n, 1.0, a, 1, b, 1);

#pragma omp section

cblas_daxpy(n, 1.2, c, 1, d, 1);

#pragma omp section

cblas_daxpy(n, 1.4, e, 1, f, 1);

#pragma omp section

cblas_daxpy(n, 1.6, g, 1, h, 1);

}

which will parallelize the functions externally, but then I might not use all of my cores, and I won't be able to take advantage of any work balancing if the vectors weren't all the same size, for example.

What's the recommended way to achieve maximum performance for code like this? Is the best practise the same on Phi?

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I would suppose your expected gain by running multiple MKL calls in parallel depends on n being small enough that your platform performance doesn't scale linearly to all cores when running them individually.  Then you would want to divide your cores among the MKL instances and pin each instance to its own group of cores.  This may be easier to accomplish by MPI than by nested OpenMP parallel. 

That can be a fairly effective scheme for MIC.  Note that daxpy isn't among the MKL functions set up for automatic offload, as it's unlikely you could overcome the burden of copying the data between MIC and host.

Leave a Comment

Please sign in to add a comment. Not a member? Join today