Problem with calling MKL gemm with pthread

Problem with calling MKL gemm with pthread

Here is my problem:

I create 4 pthread, each thread will call MKL dgemm, but I dont want to use sequential MKL, I want to use multicore MKL. Since I am running it in intel xeon phi. I want to map the first MKL degmm to core 1-15, second dgemm to core 16-30, third dgemm to core 31-45, last dgemm to core 46-60.

The reason I want to do it this way because I am running small dgemms, I think parallel this dgemms would maximum use the hardware resource. 

How can I achieve it? I use kmp_set_affinity , but I didnt get the correct performance. 

21 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

It's certainly possible to run multiple threaded copies of dgemm using MPI.

Your usage is more like OMP_NESTED, but I don't think kmp_affinity works well for that.

If this is a MIC platform, the MIC forum might be a place to ask.

>>...How can I achieve it? I use kmp_set_affinity, but I didnt get the correct performance...

In order to force some OpenMP thread to be executed on a dedicated CPU you need to use a trick based on omp_get_thread_num() OpenMP function ( a long time ago I've provided sources on IDZ ). However, Threaded MKL could ignore it and could use its own thread management (!).

Two possible solutions:

- Create 4 threads that will be executed on 4 different CPUs ( set affinity functionality needs to be used / depends on OS ) and Non-Threaded MKL version needs to be used
or
- Create 4 additional processes that will be executed on 4 different CPUs ( set affinity functionality needs to be used / depends on OS ) and Non-Threaded MKL version needs to be used

Citation :

TimP (Intel) a écrit :

It's certainly possible to run multiple threaded copies of dgemm using MPI.

Your usage is more like OMP_NESTED, but I don't think kmp_affinity works well for that.

If this is a MIC platform, the MIC forum might be a place to ask.

I am not able to use MPI in my application. My application is a thread pool, each task in pool is a mkl dgemm, then I have 4 threads to parallel execute these tasks. With MPI, I dont think there is a way to implement thread pool with MPI. 

I dont think my problem is related to MIC, since it is just MKL, it should be the same solution for both cpu or mic. Few people in mic forum know mkl :)

Citation :

Sergey Kostrov a écrit :

>>...How can I achieve it? I use kmp_set_affinity, but I didnt get the correct performance...

In order to force some OpenMP thread to be executed on a dedicated CPU you need to use a trick based on omp_get_thread_num() OpenMP function ( a long time ago I've provided sources on IDZ ). However, Threaded MKL could ignore it and could use its own thread management (!).

Two possible solutions:

- Create 4 threads that will be executed on 4 different CPUs ( set affinity functionality needs to be used / depends on OS ) and Non-Threaded MKL version needs to be used
or
- Create 4 additional processes that will be executed on 4 different CPUs ( set affinity functionality needs to be used / depends on OS ) and Non-Threaded MKL version needs to be used

Based in your solution, I need to use sequential MKL, but I have 4 thread and 60 cores. If I call sequential mkl in 15 cores, I can not get performance of 15 cores. I plan to parallel run 4 multi-thread mkl in totally 60 cores, each mkl takes 15 cores.

I can only use multiple thread, not mutiple processes, because I have shared varaiables.  

>>...Based in your solution, I need to use sequential MKL, but I have 4 thread and 60 cores. If I call sequential mkl in 15 cores,
>>I can not get performance of 15 cores. I plan to parallel run 4 multi-thread mkl in totally 60 cores, each mkl takes 15 cores.
>>
>>I can only use multiple thread, not mutiple processes, because I have shared varaiables...

I understood that you need to calculate a product of two or three matrices ( based on what dgemm does, that is C=alpha*A*B+beta*C ). So, how big are these matrices?

Hi Wei,

Is there any API function that can lock each of your threads to exactly 15 cores(in your case)?Could OS scheduler override such a behaviour?

Citation :

Sergey Kostrov a écrit :

>>...Based in your solution, I need to use sequential MKL, but I have 4 thread and 60 cores. If I call sequential mkl in 15 cores,
>>I can not get performance of 15 cores. I plan to parallel run 4 multi-thread mkl in totally 60 cores, each mkl takes 15 cores.
>>
>>I can only use multiple thread, not mutiple processes, because I have shared varaiables...

I understood that you need to calculate a product of two or three matrices ( based on what dgemm does, that is C=alpha*A*B+beta*C ). So, how big are these matrices?

less than 1K *1K. if I sequentially calling multi-thread version MKL, and let MKL takes all the cores, the performance is not good. So I am wondering if I batched mkl parallelly, can I get better performance ? 

Here is link to page with some information about the OpenMP thread pools on MIC.Unfortunately I cannot download it.
Link://software.intel.com/en-us/articles/openmp-thread-affinity-control-0

Citation :

iliyapolak a écrit :

Here is link to page with some information about the OpenMP thread pools on MIC.Unfortunately I cannot download it.
Link://software.intel.com/en-us/articles/openmp-thread-affinity-control-0

Thanks very much. I use pthread to implement the multi-threading. I just find out even if I set affinity for pthread, it actually does not set affinity for MKL threads. I will re-write the multi-threading with OPENMP to see what would happen.

Unfortunatly I do not know pthreads(I assume that you are on Linux). Does MKL internaly use pthreads to implement multithreading?I suppose that internally routines which set processor affinity will use cpuid instruction to recognize preffered logical processor and pin thread to it.This way pinning thread to logical processor is done in Win OS.I think that Linux can use the same mechanism,albeit implemented differently.OS scheduler in any time can preempt pinned by affinity thread from running on preffered core when the more priviledged thread is scheduled(ready) to run or when DIRQL occures(on Windows).

 

 

>>...less than 1K *1K. if I sequentially calling multi-thread version MKL, and let MKL takes all the cores...

I still not fully understand the essence of your processing ( still fuzzy... ). Are you going to calculate product of two ~1Kx1K matrices on 60 CPUs? Or, you actually have ~60Kx60K matrices and want to partition them to 60 ~1Kx1K matrices for calculations on 60 CPUs.

Another thing is if your "atomic" processing is just for two ~1Kx1K matrices then OpenMP overhead of Threaded MKL could possibly affect performance. I really expected matrix sizes 64Kx64K or 128Kx128K , Not 1Kx1K (!).

Citation :

Sergey Kostrov a écrit :

>>...less than 1K *1K. if I sequentially calling multi-thread version MKL, and let MKL takes all the cores...

I still not fully understand the essence of your processing ( still fuzzy... ). Are you going to calculate product of two ~1Kx1K matrices on 60 CPUs? Or, you actually have ~60Kx60K matrices and want to partition them to 60 ~1Kx1K matrices for calculations on 60 CPUs.

Another thing is if your "atomic" processing is just for two ~1Kx1K matrices then OpenMP overhead of Threaded MKL could possibly affect performance. I really expected matrix sizes 64Kx64K or 128Kx128K , Not 1Kx1K (!).

I have a bunch of matrices, whose size is between 256*256 to 1K*1K. I dont want to use MKL to execute them one by one with 60 cores. I plan to batch them, just like I said before, 4 matrices one time, each of them takes 15 cores. 

I have same problem. I am using a MKL function zheev.

I want to parallize this code. Let say i have Total 16 core.

I want MKL to use 4 core.Can i evaluate 4 MKL function(zheev) to simultaneously?

This way i can use 16(4x4) more effectively. I could not find any doc MKL Documentary.

You should be able to execute multiple copies of zheev in parallel.  I believe zheev has not yet had work on internal threading.

Hi Saurabh, Wei,

Could you please provide a small runable code, so we can test at our side?

Best Regards,

Ying

Citation :

Saurabh Pradhan a écrit :

I have same problem. I am using a MKL function zheev.

I want to parallize this code. Let say i have Total 16 core.

I want MKL to use 4 core.Can i evaluate 4 MKL function(zheev) to simultaneously?

This way i can use 16(4x4) more effectively. I could not find any doc MKL Documentary.

You can try to use pthread_setaffinity or sched_setaffinity to bind one pthread to 4 cores, and then call zheev. Dont forget to set mkl_num_threads to 4 and link with multithread MKL.

make sure enable OMP_NESTED, and disable KMP_AFFINITY, MKL_DYNAMIC and OMP_DYNAMIC. 

If this method doesnt work, let's try another way. 

Hi Wei,

As i learn, there are two questions here. 

The first one,how to archieve the thread affinity.

for example, bind one pthread to 4 core, each of threads call zheev.  

The second, can't get wanted performace on MIC with the code.

What is your real question, the first one or the secend one?

But from your reply, it seems you have completed the first one. is the assumption right?

@saurabh,

Regarding the affinity, you may refer to mkl user guide,  for example,

Consider the following performance issue:

  • The system has two sockets with two cores each, for a total of four cores (CPUs)
  • Thetwo-thread parallel applicationthat calls the Intel MKL FFThappens to run faster than in four threads, but the performance in two threadsisveryunstable

The following code example shows how to resolve this issue by setting anaffinitymask by operating system means using the Intel compiler. The code calls the system functionsched_setaffinityto bind the threads tothecoreson different sockets. Then the Intel MKLFFT functionis called:

#define _GNU_SOURCE //for using the GNU CPUaffinity // (works with the appropriate kernel and glibc) // Setaffinity mask #include <sched.h> #include <stdio.h> #include <unistd.h> #include <omp.h> int main(void) { int NCPUs = sysconf(_SC_NPROCESSORS_CONF); printf("Using threadaffinity on %i NCPUs\n", NCPUs); #pragma omp parallel default(shared) { cpu_set_t new_mask; cpu_set_t was_mask; int tid = omp_get_thread_num(); CPU_ZERO(&new_mask); // 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) CPU_SET(tid==0 ? 0 : 2, &new_mask); if (sched_getaffinity(0, sizeof(was_mask), &was_mask) == -1) { printf("Error: sched_getaffinity(%d, sizeof(was_mask), &was_mask)\n", tid); } if (sched_setaffinity(0, sizeof(new_mask), &new_mask) == -1) { printf("Error: sched_setaffinity(%d, sizeof(new_mask), &new_mask)\n", tid); } printf("tid=%d new_mask=%08X was_mask=%08X\n", tid, *(unsigned int*)(&new_mask), *(unsigned int*)(&was_mask)); }

// Call Intel MKL FFT function return 0; }  

Compile the application with the Intel compiler using the following command:

icc test_application.c -openmp 

wheretest_application.cis the filename for the application.

Build the application. Run it intwothreads, for example, by using the environmentvariable to set the number of threads:

env OMP_NUM_THREADS=2 ./a.out

SeetheLinux Programmer's Manual(in man pages format)forparticulars of thesched_setaffinityfunction used in the above example.

 

Best Regards,

Ying

 

If you are starting your zheev jobs on MIC from host threads, you should take advantage of the means to separate host affinity from MIC affinity.  You want each MIC instance of zheev to get a distinct KMP_PLACE_THREADS assignment, e.g.

export MIC_ENV_PREFIX=MIC

thread 0:

Export MIC_KMP_PLACE_THREADS=15C,1t,0O

thread 1:

export MIC_KMP_PLACE_THREADS=15C,1t.15O

...

This is more often done when running MPI on host (starting each MIC task from a separate process by the offload mechanism).  If you can show that the Automatic Offload would be effective, you would probably want to file a feature request for that support on premier.intel.com as I haven't seen zheev in the automatic offload list.

If setting KMP_PLACE_THREADS on MIC rather than remotely from host by offload, you don't use the MIC_ prefix.

Intel MPI running on MIC native accomplishes the distribution of ranks across cores by default, using

OMP_NUM_THREADS=15

KMP_AFFINITY=balanced

If you have an OpenMP threaded zheev, this would cause each copy to spread across 15 cores, 1 thread per core.  I don't know whether threading zheev has to be accomplished in some of the functions called by zheev where there may be more opportunities.

If you want to use the KMP functions to deal with threads you started by pthreads yourself, it may be possible by making an omp_num_threads call to ask this mechanism to take control of affinity masks.

Citation :

Ying H (Intel) a écrit :

Hi Wei,

As i learn, there are two questions here. 

The first one,how to archieve the thread affinity.

for example, bind one pthread to 4 core, each of threads call zheev.  

The second, can't get wanted performace on MIC with the code.

What is your real question, the first one or the secend one?

But from your reply, it seems you have completed the first one. is the assumption right?

@saurabh,

Regarding the affinity, you may refer to mkl user guide,  for example,

Consider the following performance issue:

  • The system has two sockets with two cores each, for a total of four cores (CPUs)
  • Thetwo-thread parallel applicationthat calls the Intel MKL FFThappens to run faster than in four threads, but the performance in two threadsisveryunstable

The following code example shows how to resolve this issue by setting anaffinitymask by operating system means using the Intel compiler. The code calls the system functionsched_setaffinityto bind the threads tothecoreson different sockets. Then the Intel MKLFFT functionis called:

#define _GNU_SOURCE //for using the GNU CPUaffinity // (works with the appropriate kernel and glibc) // Setaffinity mask #include <sched.h> #include <stdio.h> #include <unistd.h> #include <omp.h> int main(void) { int NCPUs = sysconf(_SC_NPROCESSORS_CONF); printf("Using threadaffinity on %i NCPUs\n", NCPUs); #pragma omp parallel default(shared) { cpu_set_t new_mask; cpu_set_t was_mask; int tid = omp_get_thread_num(); CPU_ZERO(&new_mask); // 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) CPU_SET(tid==0 ? 0 : 2, &new_mask); if (sched_getaffinity(0, sizeof(was_mask), &was_mask) == -1) { printf("Error: sched_getaffinity(%d, sizeof(was_mask), &was_mask)\n", tid); } if (sched_setaffinity(0, sizeof(new_mask), &new_mask) == -1) { printf("Error: sched_setaffinity(%d, sizeof(new_mask), &new_mask)\n", tid); } printf("tid=%d new_mask=%08X was_mask=%08X\n", tid, *(unsigned int*)(&new_mask), *(unsigned int*)(&was_mask)); }

// Call Intel MKL FFT function return 0; }  

Compile the application with the Intel compiler using the following command:

icc test_application.c -openmp 

wheretest_application.cis the filename for the application.

Build the application. Run it intwothreads, for example, by using the environmentvariable to set the number of threads:

env OMP_NUM_THREADS=2 ./a.out

SeetheLinux Programmer's Manual(in man pages format)forparticulars of thesched_setaffinityfunction used in the above example.

 

Best Regards,

Ying

 

Thanks Ying,

I already fix the problem, my solution is similar to yours, but I use kmp set affnity. BTW, in your program, there is only 1 MKL runs, since it is outside the pragma, right? but the MKL_NUM_THREADS=2.

Right now, I have a another issue with MKL. There are 4 threads, thread 1-3 are bound to core 1-3 (each thread is bound to 1 core), and thread 0 is bound to all cores (60 in intel MIC). At some point, I am going to send thread 1-3 to sleep, and let thread 0 call MKL funtion like dgemm with MKL_NUM_THREADS=240(4HT per core), but I am not able to use all the 60 cores even if thread 1-3 have already slept. I did 3 experinment, the results are pretty interesting:

Case 1, thread 0 mkl_set_num_threads_local(240), thread 1-3 mkl_set_num_threads_local(0), and send them to sleep.

thread 0 get very bad performance.

Case 2,  thread 0 mkl_set_num_threads_local(240), thread 1-3 mkl_set_num_threads_local(0), and send them to sleep, BUT thread 1-3 are NEVER bound to any cores, I just create them and let them there.

Now, thread 0 can get the peak performance of 60 cores.

Case 3, thread 0 mkl_set_num_threads_local(240-3*4), thread 1-3 mkl_set_num_threads_local(0), and send them to sleep,

In this case, thread 0 can get the peak performance of the remaining 57 cores. 

So looks like MKL does not allow overlap binding cores. It will try to avoid the cores which has taken by other threads even if those cores are not busy at all. 

Is there any way to solve this problem. 

Hi Wei, 

It seems related to how os schudule the pthreads and OpenMP threads on MIC. Could you please set KMP_AFFINITY=verbose and see if there are overloaded with some threads  under the three cases? 

Best Regards,

Ying 

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui