OpenMP Loop Collapse Directive

Compiler Methodology for Intel® MIC Architecture

OpenMP* Loop Collapse Directive

Use the OpenMP collapse-clause to increase the total number of iterations that will be partitioned across the available number of OMP threads by reducing the granularity of work to be done by each thread. If the amount of work to be done by each thread is non-trivial (after collapsing is applied), this may improve the parallel scalability of the OMP application.

You can improve performance by avoiding use of the collapsed-loop indices (if possible) inside the collapse loop-nest (since the compiler has to recreate them from the collapsed loop-indices using divide/mod operations AND the uses are complicated enough that they don't get dead-code-eliminated as part of compiler optimizations):

#pragma omp parallel for collapse(2) 
  for (i = 0; i < imax; i++) { 
    for (j = 0; j < jmax; j++) a[ j + jmax*i] = 1.; 

Modified example for better performance:

#pragma omp parallel for collapse(2) 
  for (i = 0; i < imax; i++) { 
     for (j = 0; j < jmax; j++) a[ k++] = 1.; 


It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to the chapter Efficient Parallelization

For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Cownie, James H (Intel)'s picture

As pointed out on StackOverflow the second example here (using k++) is broken. 

It introduces a race condition on k (which is also undeclared!).

A safe and correct optimised example would look like this

#pragma omp parallel for
for (int ij=0; ij<imax*jmax; ij++)
    a[ij] = 1.0;


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.