MKL FFT library performance vary from run to run by almost 100% difference

MKL FFT library performance vary from run to run by almost 100% difference

Portrait de hello world

Hi there,

I'm trying to use MKL 1D FFT library, e.g., I call 1M batch of size 1K FFT using MKL single precision.

If I just run the library call the performance was very steady and very fast, say, 0.3 seconds on my machine.

However, if I include the library call in my application, which is multi-threaded, the performance of the library call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often.

I was wondering if anyone else had experienced this or I was making mistakes and maybe there is a way to achieve good steady performance?

Thanks in advance!

13 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Tim Prince

If you are running on a platform without a single unified cache, it might be particularly important to make a suitable setting of KMP_AFFINITY, assuming no other jobs are running. 

If you call the threaded MKL from a thread which is not OpenMP, you have the problems that you may over-subscribe the hardware thread support as well as your threading not being recognized by OpenMP.  Then it would be your responsibility to control the number of threads in MKL as well as your application.

I hope these possibilities convince you that some specifics are needed.

Portrait de hello world

Quote:

TimP (Intel) wrote:

If you are running on a platform without a single unified cache, it might be particularly important to make a suitable setting of KMP_AFFINITY, assuming no other jobs are running. 

If you call the threaded MKL from a thread which is not OpenMP, you have the problems that you may over-subscribe the hardware thread support as well as your threading not being recognized by OpenMP.  Then it would be your responsibility to control the number of threads in MKL as well as your application.

I hope these possibilities convince you that some specifics are needed.

While the batched FFT is running, no other jobs are running. The FFT was called by the main thread, but after that there is some parallel work using Phtreads.

I tried to modify the KMP_AFFINITY setting according to:

http://software.intel.com/en-us/articles/using-kmp-affinity-to-create-op...

setenv KMP_AFFINITY "verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],explicit"

my machine has two Xeon E5-2690. 

I also tried proclist=[0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15], but the performance still varies from 0.3 sec to 0.6 sec for both settings.

Could you please give some hint about a "suitable" setting? Hyperthreading has already been disabled on my machine. Thanks! :-)

Portrait de Ying H (Intel)

Hi hello world,

Are you linking threaded mkl or sequential mkl?  If sequential, then the AFFINITY is not needed

There are some factors, like memory alignment, FFT usage model, KMP_AFFINITY , time etc. like the article show  http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors . It is for Xeon-phi processor, but we can refer to some of them,   For example, how do you malloc the 1M data? each of the 1K data is aligned?

You mentioned, if just run  the library call, the performance is stable, if include the library call in your application, the the performance vary, could you please show the usage model and time model or provide a simple test code?

Best Regards,

Ying

 

Portrait de Tim Prince

Your KMP_AFFINITY settings make some sense if you have disabled HyperThreading, but you would need to set OMP_NUM_THREADS so that you don't exceed 16 threads counting your simultaneously active pthreads.  If you pinned your pthreads to specific cores, I believe you would want to omit those from the proclist.  If you didn't pin the pthreads, you have a chance that the OpenMP mechanism will pin them to different cores from the MKL threads, and you could simply use KMP_AFFINITY=compact.

The hints Ying gave about aligning the buffers could be significant.

Portrait de hello world

Quote:

TimP (Intel) wrote:

Your KMP_AFFINITY settings make some sense if you have disabled HyperThreading, but you would need to set OMP_NUM_THREADS so that you don't exceed 16 threads counting your simultaneously active pthreads.  If you pinned your pthreads to specific cores, I believe you would want to omit those from the proclist.  If you didn't pin the pthreads, you have a chance that the OpenMP mechanism will pin them to different cores from the MKL threads, and you could simply use KMP_AFFINITY=compact.

The hints Ying gave about aligning the buffers could be significant.

Thanks for your reply! I had a look at Ying's post and it seems that there are two reasons that may have affected the performance: 1) in my original code memory was 16-bytes aligned, rather than 64-bytes. 2) I used g++ rather than icpc.

With a combination of 64-bytes memory alignment and icpc -mkl -openmp gives more steady performance of the library than using g++. 

using icpc gives 0.3-0.4 secs but using g++ gives 0.3-0.6 secs.

However specifying KMP_AFFNITY seems to degrade my pthread part code significantly.

My code is something like:

a) batched 1D Forward FFT using threaded MKL

b) pthread work

c) batched 1D inverse FFT using threaded MKL

1) if I don't specify all the cores for the OMP_NUM_THREADS, the MKL FFT didn't work to the full speed.

2) if I specify all the cores for the OMP_NUM_THREADS, my intermediate Pthread work is significantly slowed down.

Portrait de hello world

Quote:

Ying H (Intel) wrote:

Hi hello world,

Are you linking threaded mkl or sequential mkl?  If sequential, then the AFFINITY is not needed

There are some factors, like memory alignment, FFT usage model, KMP_AFFINITY , time etc. like the article show  http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors . It is for Xeon-phi processor, but we can refer to some of them,   For example, how do you malloc the 1M data? each of the 1K data is aligned?

You mentioned, if just run  the library call, the performance is stable, if include the library call in your application, the the performance vary, could you please show the usage model and time model or provide a simple test code?

Best Regards,

Ying

 

Hi Ying,

I tried those optimization possibilities (if possibile) on the link you posted.  It looks like if the memory is aligned to 64-bytes, the performance would be stready and reasonably good. I used to have it aligned to 16-bytes. Thanks!

I'm trying to see how Tip 5: using huge memory pages would affect the performance. :-)

Best Regards,

Jing

Portrait de Ying H (Intel)

Hi Jing,

any result? 

I had tried for Xeon Phi. As the OS had supported transparent huge page, the performance changes very slightly.

Best Regards,

Ying

Portrait de hello world

Hi Ying,

I tried to use huge page - but it needs root privilege and it may affect the performance of my other part of code.

So I tried other ways to get a steady good performance - icpc -mkl -openmp -Os gave me a pretty good performance and I just stick to it. :-)

Thanks for your help!!:-)

Best,

Jing

Quote:

Ying H (Intel) wrote:

Hi Jing,

any result? 

I had tried for Xeon Phi. As the OS had supported transparent huge page, the performance changes very slightly.

Best Regards,

Ying

Portrait de Sergey Kostrov

>>...However, if I include the library call in my application, which is multi-threaded, the performance of the library
>>call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often...

Please verify if Virtual Memory is used when execution slows. Also, it is Not clear what stack size value is set for OpenMP threads in your environment.

Portrait de hello world

Hi Sergey,

could you elaborate a little bit about the two points you made? Thanks!!

Best,

Jing

Quote:

Sergey Kostrov wrote:

>>...However, if I include the library call in my application, which is multi-threaded, the performance of the library
>>call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often...

Please verify if Virtual Memory is used when execution slows. Also, it is Not clear what stack size value is set for OpenMP threads in your environment.

Portrait de Sergey Kostrov

>>>>Please verify if Virtual Memory is used when execution slows. Also, it is Not clear what stack size value is set for
>>>>OpenMP threads in your environment...
>>
>>...could you elaborate a little bit about the two points you made?

1. Virtual Memory ( VM ) settings could be verified in a System applet of Control Panel for Windows. On Linux and a similar configuration utility needs to be called to verify VM settings.

2. At runtime stack size value for OpenMP threads could be checked as follows:
...
#include "stdio.h"
#include "stdlib.h"
...
printf( "OMP_STACKSIZE=%s\n", getenv( "OMP_STACKSIZE" ) );
...

Portrait de Sergey Kostrov

This is a follow up.

>>2. At runtime stack size value for OpenMP threads could be checked as follows:
>>...
>>#include "stdio.h"
>>#include "stdlib.h"
>>...
>>printf( "OMP_STACKSIZE=%s\n", getenv( "OMP_STACKSIZE" ) );
>>...

And for KMP_STACKSIZE as follows:

...
printf( "KMP_STACKSIZE=%s\n", getenv( "KMP_STACKSIZE" ) );
...

Connectez-vous pour laisser un commentaire.