Developer Guide and Reference


Worksharing Using OpenMP*

To get the maximum performance benefit from a processor with multi-core and Intel® Hyper-Threading Technology (Intel® HT Technology), an application needs to be executed in parallel. Parallel execution requires threads, and threading an application is not a simple thing to do; using OpenMP* can make the process a lot easier. Using the OpenMP* pragmas, most loops with no loop-carried dependencies can be threaded with one simple statement. This topic explains how to start using OpenMP* to parallelize loops, which is also called worksharing.
Options that use OpenMP* are available for both Intel® and non-Intel microprocessors, but these options may perform additional optimizations on Intel® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP* constructs and features that may perform differently on Intel® microprocessors than on non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding.
Most loops can be threaded by inserting one pragma immediately prior to the loop. Further, by leaving the details to the Intel® C++ Compiler and OpenMP*, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for maximum performance. The maximum performance of OpenMP* is realized when it is used to thread hotspots, the most time-consuming loops in your application.
The power and simplicity of OpenMP* is demonstrated by looking at an example. The following loop converts a 32-bit RGB (red, green, blue) pixel to an 8-bit gray-scale pixel. One pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution.
#pragma omp parallel for for (i=0; i < numPixels; i++) { pGrayScaleBitmap[i] = (unsigned BYTE) (pRGBBitmap[i].red * 0.299 + pRGBBitmap[i].green * 0.587 + pRGBBitmap[i].blue * 0.114); }
First, the example uses worksharing, which is the general term used in OpenMP* to describe distribution of work across threads. When worksharing is used with the
construct, as shown in the example, the iterations of the loop are distributed among multiple threads so that each loop iteration is executed exactly once with different iterations executing if there is more than one available threads. Since there is no explicit
clause, OpenMP* determines the number of threads to create and how to best create, synchronize, and destroy them. OpenMP* places the following five restrictions on which loops can be threaded:
  • The loop variable must be of type signed or unsigned integer, random access iterator, or pointer.
  • The comparison operation must be in the form
    loop_variable <, <=, >, or >= loop_invariant_expression
    of a compatible type.
  • The third expression or increment portion of the
    loop must be either addition or subtraction by a loop invariant value.
  • If the comparison operation is < or <=, the loop variable must increment on every iteration; conversely, if the comparison operation is > or >=, the loop variable must decrement on every iteration.
  • The loop body must be single-entry-single-exit, meaning no jumps are permitted from inside to outside the loop, with the exception of the
    statement that terminates the whole application. If the statements
    are used, the statements must jump within the loop, not outside it. Similarly, for exception handling, exceptions must be caught within the loop.
Although these restrictions might sound somewhat limiting, non-conforming loops can frequently be rewritten to follow these restrictions.

Basics of Compilation

Using the OpenMP* pragmas requires an OpenMP-compatible compiler and thread-safe libraries. Adding the
option to the compiler instructs the compiler to pay attention to the OpenMP* pragmas and to insert threads. If you omit the
option, the compiler will ignore OpenMP* pragmas, which provides a very simple way to generate a single-threaded version without changing any source code.
For conditional compilation, the compiler defines the
macro. If needed, the macro can be tested as shown in the following example.
#ifdef _OPENMP fn(); #endif

A Few Simple Examples

The following examples illustrate how simple OpenMP* is to use. In common practice, additional issues need to be addressed, but these examples illustrate a good starting point.
In the first example, the following loop clips an array to the range from 0 to 255.
// clip an array to 0 <= x <= 255 for (i=0; i < numElements; i++) { if (array[i] < 0) array[i] = 0; else if (array[i] > 255) array[i] = 255; }
You can thread it using a single OpenMP* pragma; insert the pragma immediately prior to the loop:
#pragma omp parallel for for (i=0; i < numElements; i++) { if (array[i] < 0) array[i] = 0; else if (array[i] > 255) array[i] = 255; }
In the second example, the loop generates a table of square roots for the numbers from 0 to 100.
double value; double roots[100]; for (value = 0.0; value < 100.0; value ++) { roots[(int)value] = sqrt(value); }
Thread the loop by changing the loop variable to a signed integer or unsigned integer and inserting a
#pragma omp parallel
int value; double roots[100]; #pragma omp parallel for for (value = 0; value < 100; value ++) { roots[value] = sqrt((double)value); }

Avoiding Data Dependencies and Race Conditions

When a loop meets all five loop restrictions (listed above) and the compiler threads the loop, the loop still might not work correctly due to the existence of data dependencies.
Data dependencies exist when different iterations of a loop (more specifically a loop iteration that is executed on a different thread) read or write the same location in shared memory. Consider the following example that calculates factorials.
// Each loop iteration writes a value that a different iteration reads. #pragma omp parallel for for (i=2; i < 10; i++) { factorial[i] = i * factorial[i-1]; }
The compiler will thread this loop, but the threading will fail because at least one of the loop iterations is data-dependent upon a different iteration. This situation is referred to as a race condition. Race conditions can only occur when using shared resources (like memory) and parallel execution. To address this problem either rewrite the loop or pick a different algorithm, one that does not contain the race condition.
Race conditions are difficult to detect because, for a given case or system, the threads might win the race in the order that happens to make the program function correctly. Because a program works once does not mean that the program will work under all conditions. Testing your program on various machines, some with Intel® Hyper-Threading Technology and some with multiple physical processors, is a good starting point to help identify race conditions.
Traditional debuggers are useless for detecting race conditions because they cause one thread to stop the race while the other threads continue to significantly change the runtime behavior; however, thread checking tools can help.

Managing Shared and Private Data

Nearly every loop (in real applications) reads from or writes to memory; it's your responsibility, as the developer, to instruct the compiler what memory should be shared among the threads and what memory should be kept private. When memory is identified as shared, all threads access the same memory location. When memory is identified as private, however, a separate copy of the variable is made for each thread to access in private. When the loop ends, the private copies are destroyed. By default, all variables are shared except for the loop variable, which is private.
Memory can be declared as private in two ways:
  • Declare the variable inside the loop-really inside the parallel OpenMP* pragma-without the static keyword.
  • Specify the private clause on an OpenMP* pragma.
The following loop fails to function correctly because the variable
is shared. It should be private.