How to set affinity of threads spawned by MKL?

How to set affinity of threads spawned by MKL?

I have a program which invokes MKL from within an OpenMP parallel region. It sets $MKL_DYNAMIC and $MKL_NUM_THREADS so that MKL will exploit nested parallelism, and calls MKL to work on different sets of data from different OpenMP threads. Is it possible to set the affinity mask of threads spawned by MKL from a specific function call?

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You may be able to set the environment variable KMP_AFFINITY or GOMP_AFFINITY prior to the parallel region. I don't think this will be effective when MKL_DYNAMIC is set. If these are your questions, it would be good to have an answer from the library experts.
I'm wondering why I don't find documentation on KMP_AFFINITY=physical, which appears to be the favored setting for HyperThreading.

Quoting - tim18
You may be able to set the environment variable KMP_AFFINITY or GOMP_AFFINITY prior to the parallel region. I don't think this will be effective when MKL_DYNAMIC is set. If these are your questions, it would be good to have an answer from the library experts.
I'm wondering why I don't find documentation on KMP_AFFINITY=physical, which appears to be the favored setting for HyperThreading.

My program sets MKL_DYNAMIC to FALSE. KMP_AFFINITY is basically something I try to avoid because they don't seem to work on AMD machines. What I hope to see is that threads executing a call to MKL will inherit the affinity mask of the calling OpenMP thread or can have their affinity masks specified (perhaps through some sched_setaffinity magic?).

Quoting - styc

My program sets MKL_DYNAMIC to FALSE. KMP_AFFINITY is basically something I try to avoid because they don't seem to work on AMD machines. What I hope to see is that threads executing a call to MKL will inherit the affinity mask of the calling OpenMP thread or can have their affinity masks specified (perhaps through some sched_setaffinity magic?).

OK, then MKL_DYNAMIC should not be interfering. When I set KMP_AFFINITY=compact,0,verbose with the 10.1 compiler on a recent AMD machine, it gives me the non-support message, but tells me it is setting affinity as if there are 8 single core CPUs. This is effectively the same as taskset -c 0-7, as far as I can see. I don't see any reasonable behavior other than for the same affinity mask to persist in the nested OpenMP. According to the doc, sched_setaffinity() would be the mechanism used for KMP_AFFINITY, so what you see by sched_getaffinity() should be what MKL is using under OMP_NESTED, subject to its own determination of how many additional threads to use.
I agree with your implication that failing to support affinity mask in a similar way on Intel and AMD platforms would be a serious deficiency.

Hello,

MKL User's Guide has a section with examples on setting affinity mask by means of operating system. The section should be named like "Managing Performance and Memory>Tips and Techniques to Improve Performance>Managing Multi-Core Performance". Have in mind that affinity mask is per-thread attribute (on Linux, at least), so it should be set after the top level OpenMP threads are initiated.

Hope this helps
Thanks
Dima

Quoting - Dmitry Baksheev (Intel)

Hello,

MKL User's Guide has a section with examples on setting affinity mask by means of operating system. The section should be named like "Managing Performance and Memory>Tips and Techniques to Improve Performance>Managing Multi-Core Performance". Have in mind that affinity mask is per-thread attribute (on Linux, at least), so it should be set after the top level OpenMP threads are initiated.

Hope this helps
Thanks
Dima

I tried that that, but it did not quite work. I pinned an OpenMP thread to a core (other threads were simply put to wait on a "#pragma omp barrier"), then called DGEMM from it and expected all MKL threads to get stuffed onto one core. But it seemed that MKL did not quite honor the affinity mask I set---the threads were spread over all cores. Of course this looks crazy. But given that, I really don't know what to do so that on a dual-socket quad-core machine, I can have one (physical) processor handle one DGEMM call and the other processor handle another call from inside the same parallel region.

Best Reply

Hi styc,

The instructions in the MKL User's Guide seem to be incomplete. The code snippet in the MKL User's Guide is apparently missing correct thread identification: instead of getpid() one should use syscall(SYS_gettid). Another issue is thatOpenMP layer appliesin terms of OpenMP threads while theyare dynamically mapped toOS threads. This issue can be worked around by settingenvvar KMP_AFFINITY=disabled (seeThread Affinity Interface) - this may have perfromance implications though, I don't know.

In summary, could you try this function for binding current thread to cpus?

// Handle up to 32 cpus
void bind_me_to(unsigned cpumask)
{
cpu_set_t mask;
pid_t tid = syscall(SYS_gettid);
int cpuid;

CPU_ZERO(&mask);
for (cpuid=0; cpuid < 32; cpuid++)
{
if (cpumask & (1< CPU_SET(cpuid, &mask);
}
sched_setaffinity(tid, sizeof(mask), &mask);
}

This function is assumed to be called in the following setup, ifI understood you correctly (ensure envvars OMP_DYNAMIC=false and MKL_DYNAMIC=false to allow MKL thread in nested parallel regions):

#pragma omp parallel default(shared) num_threads(2)
{
int omp_tid = omp_get_thread_num();
omp_set_nested(1); // nested parallel regions should be enabled
if (omp_tid==0)
{
bind_me_to(0x0f); // four threads on one socket
omp_set_num_threads(4);
do_dgemm();
}
if (omp_tid==1)
{
bind_me_to(0xf0); // four threads on another socket
omp_set_num_threads(4);
do_fft();
}
}

I hope this will help
Thanks
Dima

Quoting - Dmitry Baksheev (Intel)

Hi styc,

The instructions in the MKL User's Guide seem to be incomplete. The code snippet in the MKL User's Guide is apparently missing correct thread identification: instead of getpid() one should use syscall(SYS_gettid). Another issue is thatOpenMP layer appliesin terms of OpenMP threads while theyare dynamically mapped toOS threads. This issue can be worked around by settingenvvar KMP_AFFINITY=disabled (seeThread Affinity Interface) - this may have perfromance implications though, I don't know.

In summary, could you try this function for binding current thread to cpus?

// Handle up to 32 cpus
void bind_me_to(unsigned cpumask)
{
cpu_set_t mask;
pid_t tid = syscall(SYS_gettid);
int cpuid;

CPU_ZERO(&mask);
for (cpuid=0; cpuid < 32; cpuid++)
{
if (cpumask & (1< CPU_SET(cpuid, &mask);
}
sched_setaffinity(tid, sizeof(mask), &mask);
}

This function is assumed to be called in the following setup, ifI understood you correctly (ensure envvars OMP_DYNAMIC=false and MKL_DYNAMIC=false to allow MKL thread in nested parallel regions):

#pragma omp parallel default(shared) num_threads(2)
{
int omp_tid = omp_get_thread_num();
omp_set_nested(1); // nested parallel regions should be enabled
if (omp_tid==0)
{
bind_me_to(0x0f); // four threads on one socket
omp_set_num_threads(4);
do_dgemm();
}
if (omp_tid==1)
{
bind_me_to(0xf0); // four threads on another socket
omp_set_num_threads(4);
do_fft();
}
}

I hope this will help
Thanks
Dima

It seems that key is "KMP_AFFINITY=disabled". The program works as I suppose now. Thanks for your response!

Leave a Comment

Please sign in to add a comment. Not a member? Join today