Advanced OpenMP* Programming


This is the last of three white papers that teach you, an experienced C/C++ programmer, how to get started using OpenMP*, simplifying creation, synchronization, and deletion of threads in your applications. The first paper introduced you to the most common feature of OpenMP: work sharing for loops. The second paper taught you how to exploit non-loop parallelism and the usage of the synchronization directives. This final paper discusses the library functions, the environment variables, how to debug your application when things go wrong, and some tips for maximizing performance.

Run-Time Library Functions

As you might remember, OpenMP is a set of pragmas, function calls, and environment variables. The first two papers discussed only the pragmas, which leaves the function calls and environment variables for this paper. The reason for this arrangement is simple: the pragmas are the reason for OpenMP, because they provide the highest degree of simplicity, they do not require source-code changes, and the pragmas can be easily ignored to generate a serial version of your code. On the other hand, use of the function calls requires program changes that may make it difficult to execute a serial version if desired. When in doubt, always try to use the pragmas and reserve the function calls for those times when they are To use the function calls, include the <omp.h> header file, and, of course, continue to use the Intel® C++ Compiler command line switch /Qopenmp. No additional libraries are required for linking.

The four most heavily used library functions, shown in the table below, retrieve the total number of threads, set the number of threads, return the current thread number, and return the number of available logical processors. The complete list of OpenMP library functions can be found on the OpenMP web site at*.absolutely necessary.

int omp_get_num_threads(void); Returns the number of threads currently in use. If called outside a parallel region, this function will return 1.
int omp_set_num_threads(int NumThreads) This function sets the number of threads that will be used when entering a parallel section. It overrides the OMP_NUM_THREADS environment variable.
int omp_get_thread_num(void); Returns the current thread number between 0 (master thread) and total number of threads -1.
int omp_get_num_procs(void); Returns the number of available processors. Processors with Hyper-Threading Technology will count as two processors.


The example below uses these functions to print the alphabet.


#pragma omp parallel private(i)

{   // This code has a bug. Can you find it?

    int LettersPerThread = 26 / omp_get_num_threads();

    int ThisThreadNum = omp_get_thread_num();

    int StartLetter = 'a'+ThisThreadNum*LettersPerThread;

    int EndLetter = 'a'+ThisThreadNum*LettersPerThread+LettersPerThread;

for (i=StartLetter; i<EndLetter; i++)

printf ("%c", i);



The example above illustrates a few important concepts when using the function calls instead of the pragmas. First, your code must be rewritten, and any rewrite means extra documentation, debugging, testing, and maintenance efforts. Secondly, it becomes difficult or impossible to compile without OpenMP support. Third, it is very easy to introduce bugs, as in the above loop that fails to print all the letters of the alphabet when the number of threads is not a multiple of 26. And finally, you lose the ability to adjust loop scheduling without creating your own work-queue algorithm, which is a lot of extra effort. You are limited by your own scheduling, which is mostly likely static scheduling as shown in the example.

The Environment Variables

The OpenMP specification defines the two commonly used environment variables listed in the following table.

Environment Variable Descriptions Example
OMP_SCHEDULE Controls the scheduling of the for-loop work-sharing construct. set OMP_SCHEDULE= "guided, 2"
OMP_NUM_THREADS Sets the default number of threads. The omp_set_num
_threads() function call can override this value.


Additional compiler-specific environment variables are usually available. Be sure to review your compiler's docume ntation to get familiar with additional variables.


Debugging threaded applications is very tricky, because debuggers change the runtime performance, which can mask race conditions. Even print statements can mask issues, because they use synchronization and operating system functions. OpenMP adds even more complications. OpenMP inserts private variables, shared variables, and additional code that is impossible to examine and step through without a specialized OpenMP-aware debugger. Thus, your key debugging tool is the process of elimination.

First, realize that most mistakes are race conditions. Most race conditions are caused by shared variables that really should have been declared private. Start by looking at the variables inside the parallel regions and make sure that the variables are declared private when necessary. Also, check functions called within parallel constructs. By default, variables declared on the stack are private, but the C/C++ keyword static will change the variable to be placed on the global heap and therefore shared for OpenMP loops. The default(none) clause, shown below, can be used to help find those hard-to-spot variables. If you specify the default(none), then every variable must be declared with a data-sharing attribute clause.

#pragma omp parallel for default(none) private(x,y)



Another common mistake is the use of un-initialized variables. Remember that private variables do not have initial values upon entering a parallel construct. Use the firstprivate and lastprivate clauses to initialize them only when necessary, because doing so adds extra overhead.

Still can't find the bug? Maybe you are just working with too much code. Try a binary-hunt. Force parallel sections to be serial again with if(0) on the parallel construct or commenting out the pragma altogether. Another method is to force large chunks of a parallel region to be critical sections. Pick a region of the code that you think contains the bug and place it within a critical section. Try to find the section of code that suddenly works when it is within a critical section and fails when it is not. Now look at the variables, and see if the bug is apparent. If that still doesn't work, try setting the entire program to run in serial by setting the Intel C++ Compiler specific environment variable KMP_LIBRARY=serial.

If the code is still not working, compile it without /Qopenmp to make sure the serial version works.

The Intel® Thread Checker

Not quite a debugger, not quite a lint tool, the Intel Thread Checker provides very valuable parallel execution information and debugging hints. Using source code or binary inst rumentation, the Intel Thread Checker monitors the OpenMP pragmas, Win32 threading APIs, and all memory accesses in an attempt to identify coding errors. It can find the infrequent errors that never seem to happen during testing but always seen to happen at a customer's site. The important thing to remember when using the tool is to exercise all the code paths while accessing the least amount of memory possible, which will speed up the data-collection process. Usually, a small change to the source code or data set is required to reduce the amount of data processed by the application.

The Intel Thread Checker is a plug-in for the VTune™ Performance Analyzer that is available at Intel® Software Development Products.


OpenMP threaded application performance is largely dependent upon the following things:

  • The underlying performance of the single-threaded code.
  • CPU utilization, idle threads, and poor load balancing.
  • The percentage of the application that is executed in parallel.
  • The amount of synchronization and communication among the threads.
  • The overhead needed to create, manage, destroy, and synchronize the threads, made worse by the number of single-to-parallel or parallel-to-single transitions called fork-join transitions.
  • Performance limitations of shared resources such as memory, bus bandwidth, and CPU execution units.
  • Memory conflicts caused by shared memory or falsely shared memory.


Threaded code performance primarily boils down to two things: 1) How well the single-threaded version runs, and 2) How well you divide the work among multiple processors with the least amount of overhead. Performance always begins with a well-architected parallel algorithm or application. It should be obvious that parallelizing a bubble-sort - even one written in hand-optimized assembly language - is just not a good place to start. Also, keep scalability in mind; creating a program that runs well on two CPUs is not as efficient as creating one that runs well on n CPUs. Remember, with OpenMP the number of threads is chosen by the compiler, not you, so programs that work well regardless of the number of threads are highly desirable. Producer/consumer architectures are rarely efficient, because they are made specifically for two threads.

Once the algorithm is in place, it is time to make sure that the code runs efficiently on the Intel® architecture, and a single-threaded version can be a big help. Turing off the OpenMP compiler option, you can generate a single-threaded version and run it through the usual set of optimizations. A great reference for single-threaded optimizations is The Software Optimization Cookbook, 2nd Edition, by Richard Gerber, which is available everywhere great technical books are sold. Once you have gotten the single-threaded performance, it is time to generate the multi-threaded version and start doing some analysis.

First, look at the amount of time spent in the operating system's idle loop. The VTune Performance Analyzer is a great tool to he lp with the investigation. Idle time can indicate unbalanced loads, lots of blocked synchronization, and serial regions. Fix those issues, and then go back to the VTune Performance Analyzer to look for excessive cache misses and memory issues like false-sharing. Solve these basic problems, and you will have a well-optimized parallel program that will run well on Hyper-Threading Technology as well as multiple physical CPUs.

Sounds easy, but magical, right? Well, optimizations are really a combination of patience, trial-and-error, and practice. Make little test programs that mimic the way your application uses the computer's resources to get a feel for what things are faster than others. Be sure to try the different scheduling clauses for the parallel sections. If the overhead of a parallel region is large compared to the compute time, you may want to use the if clause, an example of which is shown below, to execute the section serially.

#pragma omp parallel for if(NumBytes > 50)


Since this series of papers is about OpenMP, it is outside the scope to present a full, detailed approach to performance optimizations. Many other papers on the Intel® Developer Zone Web site and elsewhere focus on performance optimizations. Some of my favorite ones are listed in the references section at the end of this paper.

The Thread Profiler

Finally, don't forget about the Thread Profiler. Just like the Intel Thread Checker, it is a plug-in to the VTune Performance Analyzer that paints a graphical picture of the threading performance of an application. It shows the amount of time spent in parallel sections, serial sections, overhead, synchronization, and more. It works by replacing the OpenMP pragmas with performance measuring and monitoring ones to record the data. The Thread Profiler and the VTune Performance Analyzer are available at Intel® Software Development Products.


OpenMP is a flexible and simple set of pragmas, function calls, and environment variables that explicitly instruct the compiler how and where to thread your application. By taking advantage of OpenMP functionality, threading programming is not that much harder than single-threaded programming. Happy threading!

For additional information on OpenMP, be sure to read the OpenMP specification at*.

Related Resources

Other articles in this series

Related Resources


Getting Started with OpenMP*

More Work-Sharing with OpenMP*