Developer Guide and Reference

Contents

Worksharing Using OpenMP*

To get the maximum performance benefit from a processor with multi-core and Intel® Hyper-Threading Technology (Intel® HT Technology), an application needs to be executed in parallel. Parallel execution requires threads, and threading an application is not a simple thing to do; using OpenMP* can make the process a lot easier. Using the OpenMP* pragmas, most loops with no loop-carried dependencies can be threaded with one simple statement. This topic explains how to start using OpenMP* to parallelize loops, which is also called worksharing.
Options that use OpenMP* are available for both Intel® and non-Intel microprocessors, but these options may perform additional optimizations on Intel® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP* constructs and features that may perform differently on Intel® microprocessors than on non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding.
Most loops can be threaded by inserting one pragma immediately prior to the loop. Further, by leaving the details to the Intel® C++ Compiler and OpenMP*, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for maximum performance. The maximum performance of OpenMP* is realized when it is used to thread hotspots, the most time-consuming loops in your application.
The power and simplicity of OpenMP* is demonstrated by looking at an example. The following loop converts a 32-bit RGB (red, green, blue) pixel to an 8-bit gray-scale pixel. One pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution.
Example
#pragma omp parallel for for (i=0; i < numPixels; i++) { pGrayScaleBitmap[i] = (unsigned BYTE) (pRGBBitmap[i].red * 0.299 + pRGBBitmap[i].green * 0.587 + pRGBBitmap[i].blue * 0.114); }
First, the example uses worksharing, which is the general term used in OpenMP* to describe distribution of work across threads. When worksharing is used with the
for
construct, as shown in the example, the iterations of the loop are distributed among multiple threads so that each loop iteration is executed exactly once with different iterations executing if there is more than one available threads. Since there is no explicit
numthreads
clause, OpenMP* determines the number of threads to create and how to best create, synchronize, and destroy them. OpenMP* places the following five restrictions on which loops can be threaded:
  • The loop variable must be of type signed or unsigned integer, random access iterator, or pointer.
  • The comparison operation must be in the form
    loop_variable <, <=, >, or >= loop_invariant_expression
    of a compatible type.
  • The third expression or increment portion of the
    for
    loop must be either addition or subtraction by a loop invariant value.
  • If the comparison operation is < or <=, the loop variable mu