Different parallelization techniques and Intel® MKL FFT

The following techniques can be used to parallelize your applications which use FFT from Intel MKL.  In this article the examples are threaded using Open MP in the user level.

1: You do not create threads in your application but specify the parallel mode within the FFT module of Intel MKL.

Example:  Using Intel MKL Internal Threading Mode

#include "mkl_dfti.h"

void main () {

float x[200][100];

DFTI_DESCRIPTOR_HANDLE my_desc1_handle;

MKL_LONG status, len[2];

//...put input data into x[j][k] 0<=j<=199, 0<=k<=99

len[0] = 200; len[1] = 100;

status = DftiCreateDescriptor( &my_desc1_handle, DFTI_SINGLE,DFTI_REAL, 2,len);

status = DftiCommitDescriptor(my_desc1_handle);

status = DftiComputeForward(my_desc1_handle, x);

status = DftiFreeDescriptor(&my_desc1_handle);

}

See Intel® MKL 10.0 threading for more information on how to do this.

2. You create threads in the application yourself and have each thread perform all stages of FFT implementation, including descriptor initialization, FFT computation, and descriptor deallocation.

In this case, each descriptor is used only within its corresponding thread. It is recommended to set single-threaded mode for Intel MKL.

Specify the number of threads as below:

set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (recommended) or use
mkl_set_num_threads( 1 ) threading control function.

set OMP_NUM_THREADS = n where n is the number of cores for the customer program to work in the multi-threaded mode if it is threaded using Open MP.

The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have its default value of 1.

Example: Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region

#include "mkl_dfti.h"

#include "omp.h"
void main () {

float _Complex x[200][100];

MKL_LONG len[2];

//...put input data into x[j][k] 0<=j<=199, 0<=k<=99

len[0] = 50; len[1] = 100;

// each thread calculates real FFT for matrix (50*100)

#pragma omp parallel {

DFTI_DESCRIPTOR_HANDLE my_desc_handle;

MKL_LONG myStatus;

int myID = omp_get_thread_num ();

myStatus = DftiCreateDescriptor (&my_desc_handle, DFTI_SINGLE,

DFTI_COMPLEX, 2, len);

myStatus = DftiCommitDescriptor (my_desc_handle);

myStatus = DftiComputeForward (my_desc_handle, &x [myID * len[0]] [0] );

myStatus = DftiFreeDescriptor (&my_desc_handle);

} /* End OpenMP parallel region */

}

3. You create threads in the application yourself after initializing all FFT descriptors.

This implies that threading is employed for parallel FFT computation only, and the descriptors are released upon return from the parallel region.

In this case, each descriptor is used only within its corresponding thread. It is obligatory to explicitly set the single-threaded mode for Intel MKL, otherwise, the actual number of threads may differ from one, because the DftiCommitDescriptor function is not in a parallel region.

 

Example: Using Parallel Mode with Multiple Descriptors Initialized in One Thread

#include "mkl_dfti.h"

#include "omp.h"

void main (){

float _Complex x[200][100];

MKL_LONG len[2];

MKL_LONG i;

//...put input data into x[j][k] 0<=j<=199, 0<=k<=99

len[0] = 50; len[1] = 100;

DFTI_DESCRIPTOR_HANDLE my_desc_handle[4];

MKL_LONG myStatus;

for (i=0;i<3;i++) myStatus = DftiCreateDescriptor &my_desc_handle[i], DFTI_SINGLE, DFTI_COMPLEX, 2, len);

// each thread calculates real FFT for matrix (50*100)

#pragma omp parallel {

int myID = omp_get_thread_num ();

myStatus = DftiCommitDescriptor (my_desc_handle[myID]);

myStatus = DftiComputeForward (my_desc_handle[myID], &x [myID * len[0]] [0] );

} /* End OpenMP parallel region */

for (i=0;i<3;i++) myStatus = DftiFreeDescriptor (&my_desc_handle[i]);

}

 

Specify the number of threads as:

set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (obligatory) or use
mkl_set_num_threads( 1 ) threading control function.

set OMP_NUM_THREADS = n where n is the number of cores for the customer program to work in the multi-threaded mode if you are using Open MP for threading.

The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have the default value of 1.

 

4. You create threads in the application yourself using OpenMP after initializing the only FFT descriptor.

This implies that threading is employed for parallel FFT computation only, and the descriptor is released upon return from the parallel region. In this case, each thread uses the same descriptor.

The following example illustrates a parallel user program with a common descriptor used in several threads.

Example: Using Parallel Mode with a Common Descriptor

// set number of threads inside Intel MKL:

// since one-threaded mode for Intel MKL is forced automatically

// set OMP_NUM_THREADS = 4 - multi-threaded mode for customer

 

#include "mkl_dfti.h"

#include "omp.h"

void main (){

float _Complex x[200][100];

MKL_LONG status;

DFTI_DESCRIPTOR_HANDLE desc_handle;

int nThread = omp_get_max_threads ();

MKL_LONG len[2];

//...put input data into x[j][k] 0<=j<=199, 0<=k<=99

len[0] = 50; len[1] = 100;

status = DftiCreateDescriptor (&desc_handle, DFTI_SINGLE, DFTI_COMPLEX, 2, len);

status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread);

status = DftiCommitDescriptor (desc_handle);

// each thread calculates real FFT for matrix (50*100)

#pragma omp parallel num_threads(nThread){

MKL_LONG myStatus;

int myID = omp_get_thread_num ();

myStatus = DftiComputeForward (desc_handle,  &x [myID * len[0]] [0] );

} /* End OpenMP parallel region */

status = DftiFreeDescriptor (&desc_handle);

}

 

In this case, the number of threads, as well as any other configuration parameter, must not be changed after FFT initialization by the DftiCommitDescriptor() function is done.

In cases "1", "2", and "3", listed above, set the parameter DFTI_NUMBER_OF_USER_THREADS to 1 (its default value), since each particular descriptor instance is used only in a single thread.

In case "4", you must use the DftiSetValue() function to set the DFTI_NUMBER_OF_USER_THREADS to the actual number of FFT computation threads, because multiple threads will be using the same descriptor. If this setting is not done, your program will work incorrectly or fail, since the descriptor contains individual data for each thread.

 

Warning:

• It is not recommended to simultaneously parallelize your program and employ the Intel MKL internal threading because this will slow down the performance. Note that in case "4" above, FFT computation is automatically initiated in a single-threading mode.

• You must not change the number of threads after the DftiCommitDescriptor() function completed FFT initialization.

 

Optimization Notice in English

For more complete information about compiler optimizations, see our Optimization Notice.