Performance Tools for Software Developers - Auto-parallelization and /Qpar-threshold

The auto-parallelization feature of the Intel C++ Compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good work sharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor systems, IA-32 and  Intel 64.

The following table lists the options that enable Auto-parallelization:

/Qparallel:
Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel.

/Qpar-threshold:n
This option sets a threshold for the auto-parallelization of loops based on the probability of profitable execution of the loop in parallel. To use this option, you must also specify -parallel (Linux and Mac OS X) or /Qparallel (Windows). The default is /Qpar-threshold:100.

This option is useful for loops whose computation work volume cannot be determined at compile-time. The threshold is usually relevant when the loop trip count is unknown at compile-time.

The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads.

The n is an integer whose value is the threshold for the auto-parallelization of loops. Possible values are 0 through 100. If n is 0, loops get auto-parallelized always, regardless of computation work volume. If n is 100, loops get auto-parallelized when performance gains are predicted based on the compiler analysis data. Loops get auto-parallelized only if profitable parallel execution is almost certain. The intermediate 1 to 99 values represent the percentage probability for profitable speed-up. For example, n=50 directs the compiler to parallelize only if there is a 50% probability of the code speeding up if executed in parallel.

Also, to be "100%" sure that a loop will benefit from parallelization, the compiler needs to know the iteration count at compile time. For a "99%" or lower threshold, knowing the iteration count at compile time is not a requirement.

This leads to a big difference in the number of loops parallelized at 99% compared to 100%. For many apps, 99% is a better setting, but for some apps with a lot of short loops, 99% will slow them down.

The following example, int_sin.c, does not auto parallelize when we use /Qpar-threshold:100 using command line below :

C: >icl -c /Qparallel /Qpar-report3 /Qpar-threshold:100 int_sin.c

If we use /Qpar-threshold:99 then it is parallelized.

Example:

 

// int_sin.c

// Intel C++ compiler sample program

 

#include<stdio.h>

#include<stdlib.h>

#include<time.h>

#include<mathimf.h>

 

// Function to be integrated

// Define and prototype it here

// | sin(x) |

#define INTEG_FUNC(x) fabs(sin(x))

 

// Prototype timing function

double dclock( void);

 

int main( void)

{

// Loop counters and number of interior points

unsigned int i, j, N;

// Stepsize, independent variable x, and accumulated sum

double step, x_i, sum;

// Timing variables for evaluation

double start, finish, duration, clock_t;

// Start integral from

double interval_begin = 0.0;

// Complete integral at

double interval_end = 2.0 * 3.141592653589793238;

 

// Start timing for the entire application

start = clock();

 

printf( " ");

printf( " Number of | Computed Integral | ");

printf( " Interior Points | | ");

for (j=2;j<10;j++)

{

printf( "------------------------------------- ");

 

// Compute the number of (internal rectangles + 1)

N = 1 << j;

 

// Compute stepsize for N-1 internal rectangles

step = (interval_end - interval_begin) / N;

 

// Approx. 1/2 area in first rectangle: f(x0) * [step/2]

sum = INTEG_FUNC(interval_begin) * step / 2.0;

 

// Apply midpoint rule:

// Given length = f(x), compute the area of the

// rectangle of width step

// Sum areas of internal rectangle: f(xi + step) * step

 

for (i=1;i<N;i++)

{

x_i = i * step;

sum += INTEG_FUNC(x_i) * step;

}

 

// Approx. 1/2 area in last rectangle: f(xN) * [step/2]

sum += INTEG_FUNC(interval_end) * step / 2.0;

 

printf( " %10d | %14e | ", N, sum);

}

finish = clock();

duration = (finish - start);

printf( " ");

printf( " Application Clocks = %10e ", duration);

printf( " ");

}

 
Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.