Developer Guide and Reference

Contents

Worksharing Using OpenMP*

To get the maximum performance benefit from a processor with multi-core and Intel® Hyper-Threading Technology (Intel® HT Technology), an application needs to be executed in parallel. Parallel execution requires threads, and threading an application is not a simple thing to do; using OpenMP* can make the process a lot easier. Using the OpenMP* pragmas, most loops with no loop-carried dependencies can be threaded with one simple statement. This topic explains how to start using OpenMP* to parallelize loops, which is also called worksharing.
Options that use OpenMP* are available for both Intel® and non-Intel microprocessors, but these options may perform additional optimizations on Intel® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP* constructs and features that may perform differently on Intel® microprocessors than on non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding.
Most loops can be threaded by inserting one pragma immediately prior to the loop. Further, by leaving the details to the Intel® C++ Compiler and OpenMP*, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for maximum performance. The maximum performance of OpenMP* is realized when it is used to thread hotspots, the most time-consuming loops in your application.
The power and simplicity of OpenMP* is demonstrated by looking at an example. The following loop converts a 32-bit RGB (red, green, blue) pixel to an 8-bit gray-scale pixel. One pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution.
Example
#pragma omp parallel for for (i=0; i < numPixels; i++) { pGrayScaleBitmap[i] = (unsigned BYTE) (pRGBBitmap[i].red * 0.299 + pRGBBitmap[i].green * 0.587 + pRGBBitmap[i].blue * 0.114); }
First, the example uses worksharing, which is the general term used in OpenMP* to describe distribution of work across threads. When worksharing is used with the
for
construct, as shown in the example, the iterations of the loop are distributed among multiple threads so that each loop iteration is executed exactly once with different iterations executing if there is more than one available threads. Since there is no explicit
numthreads
clause, OpenMP* determines the number of threads to create and how to best create, synchronize, and destroy them. OpenMP* places the following five restrictions on which loops can be threaded:
  • The loop variable must be of type signed or unsigned integer, random access iterator, or pointer.
  • The comparison operation must be in the form
    loop_variable <, <=, >, or >= loop_invariant_expression
    of a compatible type.
  • The third expression or increment portion of the
    for
    loop must be either addition or subtraction by a loop invariant value.
  • If the comparison operation is < or <=, the loop variable must increment on every iteration; conversely, if the comparison operation is > or >=, the loop variable must decrement on every iteration.
  • The loop body must be single-entry-single-exit, meaning no jumps are permitted from inside to outside the loop, with the exception of the
    exit
    statement that terminates the whole application. If the statements
    goto
    or
    break
    are used, the statements must jump within the loop, not outside it. Similarly, for exception handling, exceptions must be caught within the loop.
Although these restrictions might sound somewhat limiti