Intel® C++ Compiler Classic Developer Guide and Reference

ID 767249
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Enabling Further Loop Parallelization for Multicore Platforms

Parallelizing loops for multicore platforms is subject to certain conditions. Three requirements must be met for the compiler to parallelize a loop:

  • The number of iterations must be known before entry into a loop to insure that the work can be divided in advance. A do while loop, for example, usually cannot be made parallel.
  • There can be no jumps into or out of the loop.
  • The loop iterations must be independent (no cross-iteration dependencies).

Correct results must not logically depend on the order in which the iterations are executed. There may be slight variations in the accumulated rounding error, for example, when the same quantities are added in a different order. In some cases, such as summing an array or other uses of temporary scalars, the compiler may be able to remove an apparent dependency by a simple transformation.

Potential aliasing of pointers or array references is another common impediment to safe parallelization. Two pointers are aliased if both point to the same memory location. The compiler may not be able to determine whether two pointers or array references point to the same memory location. For example, if they depend on function arguments, runtime data, or the results of complex calculations.

If the compiler cannot prove that pointers or array references are safe, it will not parallelize the loop. In limited cases, when it is deemed worthwhile to generate alternative code paths to test explicitly for aliasing at runtime, the compiler will still parallelize the loop.

An alternative way in C to assert that a pointer is not aliased is to use the restrict keyword in the pointer declaration, along with the [Q]restrict command-line option. The compiler will never parallelize a loop that it can prove to be unsafe.

If you know parallelizing a particular loop is safe and that potential aliases can be ignored, you can instruct the compiler to parallelize the loop using the #pragma parallel pragma.

Parallelizing Loops with Cross-iteration Dependencies

Before the compiler can auto-parallelize a loop, it must prove that the loop does not have potential cross-iteration dependencies that prevent parallelization. A cross-iteration dependency exists if a memory location is written to in an iteration of a loop and accessed (read from or written to) in another iteration of the loop. Cross-iteration dependencies often occur in loops that access overlapping array ranges, such as a loop that reads from a(1:100) and writes to a(0:99).

Sometimes, even though a loop does not have cross-iteration dependencies, the compiler does not have enough information to prove it and does not parallelize the loop. In such cases, you can assist the compiler by providing additional information about the loop using the #pragma parallel pragma. Adding the #pragma parallel pragma before a for loop informs the compiler that the loop does not have cross-iteration dependencies. Auto-parallelization analysis ignores potential dependencies that it assumes could exist. However, the compiler may not parallelize the loop if heuristics estimate parallelization is unlikely to increase performance of the loop.

The #pragma parallel always pragma has the same effect to ignore potential dependencies as the #pragma parallel pragma, but it also overrides the compiler heuristics that estimate the likelihood that parallelization of a loop would increase performance. It allows a loop to be parallelized even when the compiler estimates that parallelization might not improve performance.

The #pragma noparallel pragma prevents auto-parallelization of the immediately following for loop. Unlike #pragma parallel , which is a hint, the noparallel pragma is guaranteed to prevent parallelization of the following loop.

These pragmas take effect only if auto-parallelization is enabled by the option [Q]parallel.

Parallelizing Loops with Private Clauses

When you use the Guided Auto Parallelism feature, the compiler’s auto-parallelizer gives you advice on where to alter your program to enhance parallelization. For instance, you may get advice to check if a condition (that the compiler could not prove) is true. If the condition is true, the compiler inserts #pragma parallel in your source code so that the associated loop is parallelized when you recompile.

To specify that it is legal for each thread to create a new, private copy (not visible by other threads) of a variable, and replace the original variable in the loop with the new private variable, use the #pragma parallel pragma with the private clause. The private clause allows you to list scalar and array type variables and specify the number of array elements to privatize.

Use the firstprivate clause to specify private variables that need to be initialized with the original value before entering the parallel loop.

Use the lastprivate clause to specify those variables with a value you want to reuse after it exits a parallelized loop. When you use the lastprivate clause to handle a particular privatized variable, the value is copied to the original variable when it exits from the parallelized loop.

NOTE:

Do not use the same variable in both private and lastprivate clauses for the same loop. You will get an error message.

Parallelizing Loops with External Function Calls

The compiler can only effectively analyze loops with a relatively simple structure. For example, the compiler cannot determine the thread safety of a loop containing external function calls because it does not know whether the function call might have side effects that introduce dependencies. You can invoke interprocedural optimization with the [Q]ipo option. Using this option allows the compiler to analyze the called function for side effects.

Parallelizing Loops with OpenMP*

When the compiler is unable to automatically parallelize loops you know to be parallel, use OpenMP*. OpenMP* is the preferred solution because you understand the code better than the compiler and can express parallelism at a coarser granularity. Alternatively, automatic parallelization can be effective for nested loops, such as those in a matrix multiply. Moderately coarse-grained parallelism results from threading of the outer loop, allowing the inner loops to be optimized for fine-grained parallelism using vectorization or software pipelining.

Threshold Parameter to Parallelize Loops

If a loop can be parallelized, it does not necessarily mean that it should be parallelized. The compiler uses a threshold parameter to decide whether to parallelize a loop. The [Q]par-threshold compiler option adjusts this behavior. The threshold ranges from 0 to 100 , where 0 instructs the compiler to always parallelize a safe loop and 100 instructs the compiler to only parallelize those loops for which a performance gain is highly probable. Use the -qopt-report-phase=par (Linux and macOS) or /Qopt-report-phase:par (Windows*) option to determine which loops were parallelized. The compiler will also report which loops could not be parallelized and indicate probable reason(s) why. See OpenMP* and Parallel Processing Options for more information on the using these compiler options.

The following example illustrates using the options in combination.

Example code

void add (int k, float *a, float *b) {
  for (int i = 1; i < 10000; i++) {
   a[i] = a[i+k] + b[i];
  } 
}

Entering a command-line compiler command similar to the following will result in the compiler issuing parallelization messages:

//Linux* and macOS
icpc -c -parallel -opt-report-phase=par -opt-report=3 add.cpp

The compiler might report results similar to the reports listed below:

Sample results

add.cpp 
procedure: 
add serial loop: line 2 
anti data dependence assumed from line 2 to line 2, due to "a" 
flow data dependence assumed from line 2 to line 2, due to "a" 
flow data dependence assumed from line 2 to line 2, due to "a"

Because the compiler does not know the value of k, the compiler assumes the iterations depend on each other, for example if k equals -1, even if the actual case is otherwise. You can override the compiler by inserting the #pragma parallel pragma.

Example

void add(int k, float *a, float *b) {
  #pragma parallel
  for (int i = 0; i < 10000; i++) {
    a[i] = a[i+k] + b[i];
  }  }

CAUTION:

Do not call this function with a value of k that is less than 10000; passing a value less than 10000 could lead to incorrect results.