How to get peak performnace in FFT

How to get peak performnace in FFT

Hi,

I want to get peak performance using batch FFT. I am giving sample code below that i have written. I have compiled on host using command icc -mkl fftcheck.cpp -o fftchecknew -L/opt/intel/mkl/lib/intel64. Also i have set some environment variables mentioned below.

export  MIC_ENV_PREFIX=MIC
export  MIC_KMP_AFFINITY=scatter,granularity=fine
export  MIC_OMP_NUM_THREADS=240
export MIC_USE_2MB_BUFFERS=64K.

./fftchecknew 512 1024 1024.

output

1024 512 627.922913

Here number of operations= 2.5* M* N*(log2(M*N))*numberOfTransforms. here M = 1024, N=1024, numberOfTransforms = 512.

So gflops = operations/time = (26843545600/627.93* 10^6) = 42Gflops. I have tried all the optimizatons suggested in http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functi....

but there they are using different FFT function and getting 100+gflops. Is that the reason to have less gflops? Can some one please tell me How to improve performnace?

#pragma offload_attribute(push, target(mic))
#include <mkl.h>
#include<immintrin.h>
#pragma offload_attribute(pop)
#include <sys/time.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define NUM_THREADS 1
__declspec(target(mic)) float* sxdreal;
__declspec(target(mic)) float* sxdimag;
__declspec(target(mic)) DFTI_DESCRIPTOR_HANDLE hand1[NUM_THREADS];
int main(int argc, char* argv[])
{

int transforms = atoi(argv[1]);
int rows = atoi(argv[2]);
int cols = atoi(argv[3]);
#pragma offload target(mic) in(transforms, rows,cols) nocopy(sxdreal,sxdimag,hand1)
    {
        sxdreal = (float*)_mm_malloc(sizeof(float)*transforms*rows*(cols),64);
        sxdimag = (float*)_mm_malloc(sizeof(float)*transforms*rows*(cols),64);
        MKL_LONG status;
        MKL_LONG N1[2]; N1[0] = rows; N1[1] = cols;
        for(int i=0; i<NUM_THREADS;i++)
        {
            status = DftiCreateDescriptor(&hand1[i], DFTI_SINGLE, DFTI_COMPLEX, 2,N1);
            status = DftiSetValue(hand1[i], DFTI_COMPLEX_STORAGE, DFTI_REAL_REAL);
            status = DftiSetValue(hand1[i], DFTI_NUMBER_OF_TRANSFORMS, transforms);
            status = DftiSetValue(hand1[i], DFTI_INPUT_DISTANCE, rows*(cols));
            if (0 != status) printf("failed\n");
            status = DftiCommitDescriptor(hand1[i]);
        }
    }
    #pragma offload target(mic)  nocopy(sxdreal,sxdimag,hand1)
        {
        MKL_LONG status;
        status = DftiComputeForward(hand1[0],sxdreal, sxdimag);
        }
    
    double time = clock();
    timeval m_timer;
    gettimeofday(&m_timer, NULL);
#pragma offload target(mic)  nocopy(sxdreal,sxdimag,hand1)
        {
            MKL_LONG status;
            int thread = 0;//omp_get_thread_num();
            for(int i=0; i< 10; i++)
            status = DftiComputeForward(hand1[thread],sxdreal, sxdimag);
            if (0 != status) printf("failed\n");
        }
        
    timeval now;
    gettimeofday(&now,NULL);
    double secs = now.tv_sec-m_timer.tv_sec;
    double usecs = now.tv_usec-m_timer.tv_usec;
    if (usecs<0)
    {
    secs--;
    usecs+=1000000;
    }
    float diff = secs*1000+usecs/1000+0.5;
    double remain = clock()-time;
    printf("%d %d %f\n",rows, transforms,diff/10);
    
#pragma offload target(mic) nocopy(hand1)
    {
        
        for(int i=0 ; i< NUM_THREADS; i++)
            DftiFreeDescriptor(&hand1[i]);
    }

}

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Did you really intend to set thread=0 below?


	int thread = 0;//omp_get_thread_num();

	  for(int i=0; i< 10; i++)

	     status = DftiComputeForward(hand1[thread],sxdreal, sxdimag);

	

Jim Dempsey
 

export  MIC_OMP_NUM_THREADS=240

That is a very bad idea. You are almost certainly over-subscribing the machine, since in offload mode one core (four threads) is used by the offload daemon.  Much better not to force it at all, and let the system do the right thing. 

If you want to prove to yourself that you are over-subscribed, then set your affinity to include ",verbose" and look where the threads have been placed.

If you want to ignore the advice not to use the last core, then include ",norespect" in your affinity and keep the 240 thread OMP_NUM_THREADS.

Dear shiva rama krishna bharadwaj I.,

Your formula for flops implies that you want to do 2D real-to-complex FFT.

Please see the correct setup below -- I hope you can easily match the fixups against your code.

You may also want to look at the MKL FFT examples basic_sp_real_dft_2d.c and config_number_of_transforms.c in examples/dftc.

Thank you.

Evgueni.

  // ...
  sxdreal = (float*)_mm_malloc(sizeof(float)*transforms*rows*(cols/2+1)*2,64);
  // ...
  status = DftiCreateDescriptor(&hand1[i], DFTI_SINGLE, DFTI_REAL, 2,N1);
  status = DftiSetValue(hand1[i], DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX);
  status = DftiSetValue(hand1[i], DFTI_NUMBER_OF_TRANSFORMS, transforms);
  status = DftiSetValue(hand1[i], DFTI_INPUT_DISTANCE, rows*(cols/2+1)*2);
  status = DftiSetValue(hand1[i], DFTI_OUTPUT_DISTANCE, rows*(cols/2+1));
  // ...
  status = DftiComputeForward(hand1[0],sxdreal);
  // ...

 

Yes jimdempseyatthecove, i did that intendedly.

Hi James Cownie, This time i did not set the environment variable MIC_OMP_NUM_THREADS. I let it run normally. but i didnt see any improvement in performance.

Hi Evgueni Petrov aka espetrov i want to do complex- complex transformation only. 

I tried to saturate the FFT by varying the Number of transformation as well as FFT size. I got 42gflps as maximum throughtput. My problem here is even though MIC has 2Tflops(single precision) of computational capacity i am getting around 42gflops using intel MKL. In the example mentioned in http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functi... has 100+gflops. Please tell what could be my mistake.

 

Thanks

sivaramakrishna

 

Oh, I see. In this case the correct flops formula is 5*N*M*log2(N*M)*NumberOfTransforms and the correct setup is as follows.

  // ...
  sxdreal = (float*)_mm_malloc(sizeof(float)*transforms*rows*cols*2,64);
  // ...
  status = DftiCreateDescriptor(&hand1[i], DFTI_SINGLE, DFTI_COMPLEX, 2,N1);
  status = DftiSetValue(hand1[i], DFTI_NUMBER_OF_TRANSFORMS, transforms);
  status = DftiSetValue(hand1[i], DFTI_INPUT_DISTANCE, rows*cols);
  // ...
  status = DftiComputeForward(hand1[0],sxdreal);
  // ...

 

Hi Evgueni Petrov aka espetrov,

If i have only one complex array. then the configuration would be like as u said. But here i have 2 arrays one has real values and other has imaginary values. that is why my configuration is like that. 

 

Thanks

sivaramakrishna

(1) I also think that the operation count should be doubled.  The descriptor is set up for DFTI_COMPLEX and DFTI_COMPLEX_STORAGE, so these are 1024x1024 complex transforms that should have a 5*N*log(N,2) work estimate.  Splitting the data into separate real and imaginary arrays with DFTI_REAL_REAL does not change the work requirement -- it is still a set of 1024x1024 single precision complex FFTs.

This changes the result from 42 GFLOPS to 84 GFLOPS, which is much closer to the best case reported values.

(2) The reference quoted at the very top for an MKL FFT getting over 100 GFLOPS was for a native code, not an offload code.  I have modified that code and used it to obtain average performance of about 122 GFLOPS for single-precision complex-to-complex transforms of length 2^20.   For these transforms I got the best performance with 1 thread per core (60 threads on my Xeon Phi SE10P Coprocessors).  (Note that this transform size is cache-contained, which probably explains the performance boost relative to the reference case.)

(3) Although it looks like you are doing all the right things to avoid data transfers across the PCI bus, it would probably be helpful to test this same problem size in a native version to see whether using the offload model is causing any performance loss.

(4) It makes me nervous to see that your array spacing is a large power of 2.  Even with the unusual address mapping features on Xeon Phi you can still get DRAM bank conflicts for large power-of-two spacings.  In this case the starting points of each of your real & imaginary fields are 4 MiB apart and I have seen significant slowdowns with 8 MiB spacing when running the STREAM benchmark.   I found that I had to pad the arrays so that the starting points were at least 8000 Bytes away from (an integral multiple of) 8 MiB in order to avoid the slowdown due to bank conflicts.

(5) Each of your fields is 4 MiB + 4 MiB = 8 MiB, so 512 of them take up 4096 MiB.    This is a lot bigger than your cache, and I don't know how well Intel has tuned MKL to handle mutliple FFTs with proper consideration for cache size.   The Intel example that you cited initially was for a 4Kx4K single-precision real transform.   Although this is bigger than the cache, each of the rows/columns is only 16 KiB, which fits in an L1 data cache.   With 512 transforms, there are opportunities to try to execute more than one transform at a time (to increase parallelism), which can easily cause you to overflow the caches and end up running slower.  Obviously the MKL people understand this, but their heuristics have to be derived from a finite number of test cases, so the execution may not be optimal in this case.     For 512 1Kx1K transforms you might get the best performance by simply distributing the transforms across the cores (rather than parallelizing the execution of each transform), and maybe MKL does this.  In any case it is clear that there are too many possible implementation strategies to identify the cause of the performance limitation without a lot more testing.

"Dr. Bandwidth"

Thank you shiva rama krishna bharadwaj I. and John!

Please note that we have spent much more time tuning MKL FFTs for DFTI_COMPLEX_COMPLEX (used in the article referenced at the top of this thread ) than for DFTI_REAL_REAL.

As Jonh says, it makes sense to pad each row (in C layout) so that the distance between column elements is not a power of 2 -- please check http://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-on-intel-xeon-phi-coprocessor.

Thanks John for your inputs.

1) As per your suggestion I ran the code in native mode. I noticed no chnage in time.

2)As you suggested to pad some elements to avoid bank conflicts, I tried this experiment by varying padding size. I have tried two different variations of padding elements. First one is pad after each slice(i.e col*row number of elements) in both real and imaginary arrays. Second one is to pad at the end of real array. In both these experiments i varied the padding length upto 3000. I didnt find any improvement in timings. I attached the codes for both of these experiments.

3) Also i have tried batch FFT using complex array for same input. I noticed that there is significant time difference between FFT timings if i use a complex array and if I use two seperate real and imaginary arrays. Say my FFT lenght is 1024*1024, and number of transforms = 512, the timings are as follows. I attached the file with name.

FFT using complex array = 278msec, FFT using two seperate real, imaginary arrays = 630msec.

Can some give me any insights on this.

Thanks

sivaramakrishna

Attachments: 

It looks like you confirmed the point made by Evguieni Petrov -- Intel has spent less time optimizing the code that handles DFTI_REAL_REAL than they spent optimizing the code that handles DFTI_COMPLEX_COMPLEX. 

Using the 5NlnN work estimate, the 278 ms timing corresponds to 193 GFLOPS, which seems like a very good number to me.

Just reading the array from memory once and writing it back will take about 60 milliseconds (with heavily optimized code copying data at 150 GB/s -- 75 GB/s read + 74 GB/s write concurrently).   2D FFTs usually move the data several times, but since you are doing multiple independent FFTs that are individually cacheable, it is hard to say what the actual data traffic should be.  (It is easy enough to measure, but hard to do enough analysis on all the possible ways that the operations could be ordered and parallelized to know whether the observed traffic is the "right amount" for the optimal algorithm on this particular piece of hardware for this particular problem configuration.)

"Dr. Bandwidth"

>> but since you are doing multiple independent FFTs that are individually cacheable,

In batch mode, you might be able to move the next array (and prior results) while you are computing the current array. This is in offload mode.

Jim Dempsey

The original code is in offload mode, but it is not copying the data back and forth between the host and the coprocessor, and it is running the forward transform on the same data 10 times in a row.    Unless heroic manual optimizations were applied it is likely that the "final" result of each transform is left as modified data in the cache, so its transfer to memory will occur as the lines are chosen as victims while the input data is being (re-)loaded for the next transform.  This provides automatic overlap of input and output.   What is less clear is how much data MKL works on at one time -- it could perform one transform at a time, two transforms at a time, etc, all the way up to 512 transforms at a time.  These choices have very different cache footprints, and also present different opportunities and constraints on parallelism.

"Dr. Bandwidth"

From my limited understanding/experience of/with MKL is MKL has two flavors of library: single-threaded, and multi-threaded.

Unlike a decade or so ago, "single-threaded" in this sense, now means "multi-thread-safe/using one thread".

And "multi-threaded" means instantiate a thread pool to the extent possible using either the process limited available hardware threads or the environment variable limited number of software threads. Then on each call to MKL, look at the problem size (for the call) and choose the most effective team size (sizes). Unfortunately, if you parallel the parallel you end up with over-subscription and/or excessive cache evictions.

I do not know if the newer versions of MKL permit the application to pass in a bit mask of (or number of) logical processors on which to restrict the current call. The following assumes it does not.

If the use application can make use of multiple concurrent, independent, transformations then

When the number of concurrent transformations is large (on the order of or larger than number of available logical processors), then the application should be linked with the "single-threaded" MKL library, and the program should be setup to process the transformations in a parallel region (e.g. #pragma omp parallel for).

However, when the number of independent transformations is small (and size of transformation large), then the application should be linked with the "multi-threaded" MKL library, and the program setup to process each transformation serially.

The problem then is what to do when the number of transformations are medium and the size of each transformation is medium? In this situation, it might be best to have the application run as multiple processes, each process with a different number of hardware threads (affinities) and have each run with the variant of the compute section using the multi-threaded MKL library. This could be done using OpemMPI or co-arrays feature (which uses OpenMPI).

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today