Potential conflict between omp threads and cilk+ workers?

Potential conflict between omp threads and cilk+ workers?

Dear everyone, 

The main part of my application is using the work stealing approach provided by cilk+ or TBB. However, some blas1 level routines which I have no time to implement one by one, I chose to use MKL. That leads me to a potential dilemma, because I know that MKL employs omp threads, whereas the cilk+ or TBB have their own threading library. Do I find myself stuck in a trap caused by potential confilct between omp threads and cilk+ workers? By conflict I mean the risks like oversubscription which will do harm to overall performance.  

Thanks for any comments and suggestions. 



5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I am not too familiar with the latest versions of MKL, but unfortunately, I think you may be correct in your analysis.  If you are calling the parallel versions of BLAS1 routines from MKL, and they are parallelized using OpenMP, then I think you risk oversubscription if you are using default values for CILK_NWORKERS and OMP_NUM_THREADS.

If you have enough parallel work outside the BLAS1 calls, then you may still have enough parallelism in your application to just call serial BLAS1 routines.   Alternatively, you might also try to partition workers between OpenMP and Cilk Plus, e.g., if you have 8 cores total, then perhaps set CILK_NWORKERS to 6 and OMP_NUM_THREADS to 2.   Unfortunately, I would guess that either of these approaches would not work as well as if one had native Cilk Plus or TBB implementations of the routines you want...

Which routines are you using?    Perhaps someone out there has parallelized some of the routines you are looking for in Cilk Plus, and might be willing to share their code?



Do you really mean BLAS 1?  If so, unless the vector sizes are huge, you may as well use a serial version of them.  I don't know how to force MKL into serial mode offhand, but that is probably the best solution; I would set OMP_NUM_THREADS to 1 and see how it goes, but there may be a better solution.

Alternatively, you can get the untuned BLAS libraries as source, but MKL will almost certainly be faster, as it will almost certainly use AVX/SSE/etc. vectorisation.

Is there any way out of the oversubscription problem other than carefully parceling out threads?  I know that there are oversubscription problems when you nest parallelism with OpenMP.  I'm not sure you can fix this even if you modify your top level to use OpenMP instead of TBB or Cilk.

Note that I don't follow OpenMP, so this may have been fixed/ameliorated by new features in OpenMP 4.

    - Barry

If you call MKL inside a cilk_for or spawn, you may wish to consider linking mkl:sequential.

When you run an MKL threaded function in a cilk parallel region, it will not default to single thread as it would when called from an OpenMP parallel region, so you would need to limit the total number of threads to the number of cores, and you would lose any advantage of OpenMP affinity setting.

When you return from an MKL threaded function, OpenMP will hang onto its threads for the interval set by KMP_BLOCKTIME, so those threads will not be available to Cilk workers until BLOCKTIME times out.  Even if you set OMP_NUM_THREADS=1, you may expect that thread to be reserved until timeout.

Leave a Comment

Please sign in to add a comment. Not a member? Join today