Performance Obstacles for Threading: How do they affect OpenMP code?

by Paul Lindberg


Introduction

Now that multi-core processors are becoming mainstream, developers need to thread their code so it will run in parallel. OpenMP1 can provide a useful way to thread an application. But what should you know about your threaded code’s performance, and what does that mean when using OpenMP? In an earlier paper2, we discovered that all threading methods have a consistent startup cost, but that OpenMP has some performance advantages over typical Windows* threading, due to its use of thread pooling.

Understanding OpenMP performance is essential when deciding how and where to thread your code. This is especially true when making architectural trade-offs and algorithm changes for threading. Before prototyping and measuring OpenMP code in your application, you should know what performance impact OpenMP threading might have in your code.

In this paper, we will look at various aspects of threading with OpenMP, and understand the impact they can have on the performance of your code. We’ll look at some of the more common OpenMP directives. We’ll also look at the runtime costs when you hand-schedule your code, which can be necessary if your loop bodies vary a lot at runtime. This paper will help if you are threading existing serial code (or considering it), modifying and measuring existing parallel code, or creating a new parallel design from scratch. Threading existing code can be the hardest, so we’ll focus on that.


Why thread, and what does that have to do with performance?

You might thread your application for three different reasons. Each requires different measurements of performance.

  1. Doing the same work faster:
    If we have a fixed application workload (e.g., applying an effect to a still photo), we thread to get the work done faster. When measuring this code, we’ll record the execution times and the speedup gained by threading.
  2. Doing more work:
    If an application is extended to do more of the same workload (e.g., updating a larger buffer of pixels) or to add different work to the workload (e.g., adding particle effects to a game pipeline), then we thread to get more work through the application. We usually measure this as throughput. To measure this code, we’ll want to measure throughput, which is the “volume” of work measured against overall execution time.
  3. Offsetting the time taken by slow operations:
    If an application has long operations like file loads, then we thread to do these operations in advance, so the results are available as soon as they’re needed. There are several ways to measure this code, but the main measure is a qualitative one; are there any user-perceived delays from these operations? Depending on your workload, this may or may not be easy to measure. Many applications (especially games) are already using threading (and some other technologies, like asynchronous disk I/O) to offset the time taken by slow operations.

 

Any combination of these might give you more performance in your code. Decide which of these matters to you. The way you thread and the benchmarks you’ll use will depend on your priorities.

Some code combines these approaches. For example, say you want to move the physics part of a game’s render pipeline into a separate thread. You probably want to do this so your code can run more-complete physics simulations. This code is in the render pipeline so it can act on object data as it flows through the pipeline. If you relax your requirements of consistency and timing, and let the physics subsystem work on a copy of data that’s one frame out of sync, then you have the ability to move physics into a separate thread. This also requires an extra layer of buffering, so the physics operation has a separate piece of data to work with.

A curious thing happens when you make this change; the additional buffering cost is often far less than the original physics calculation, so the rest of the render pipeline runs substantially faster. This tends to drive the game’s frame rate up, but you could also use it to add richer steps into the pipeline.

This kind of change gives a double benefit; in this case, higher frame rate and richer physics simulation.


How do I decide where to thread my code?

When threading code, look for ways to decompose the code so that parts can run independently, and assign those parts to separate threads. We divide the code into separate functions (functional decomposition) or code that can independently perform the same work on separate chunks of data (data or domain decomposition). How do we find these opportunities for threading in code?

 
Amdahl’s Law

Amdahl’s Law describes the theoretical speedup possible for any given code.

For a serial fraction of your code F, you can expect a theoretical speedup on N processors of:



If we thread 20% of the code (80% remains serial), we’ll get a maximum speedup on 4 processors of:



We can also use Amdahl’s law to predict the upper limit of our speedup (set N to 8). We use it more often to predict our speedup on the typical case of two or four processor cores, where we expect threading to give us a significant benefit.
 

 

The basic rule is to identify the regions of serial code that take the most time, and thread them. We can use Amdahl’s Law3 (see sidebar) to predict the maximum speedup we can get from this split in the code.

This says that you ideally want to turn all your serial code into parallel code. To start, you should try to thread your code as far toward your outermost loop as you can. Often, dependencies within (and between) loop iterations make this difficult. Look for these parts of your code, but don’t be surprised if dependencies keep you from threading there. There are a few algorithms that are naturally easy to thread (the “embarrassingly parallel” problems4), but most code has dependencies that keep you from threading your whole application.

On the other end of the scale, if you have long-running loops with no dependencies at the lowest level in your code, those are easy to thread. We’ll discuss some of the per-thread overheads below, but this approach is only practical when using a thread pool, as OpenMP does. If we had to create new threads every time we wanted to run a low-level loop, the thread creation cost would overwhelm any benefit.

You should start threading at the most-executed parts of your code (hotspots), but OpenMP makes it practical to add threading to all the lowest-level loops in your code. Each may have only a small incremental performance gain, but these may all add up to a significant gain overall.


How do I measure my threaded code?

You should measure any code that you have threaded. There are a few different ways to do this, depending on what you want to measure. In some cases, it works best to install local timing or profiling code; your code might already have a profiling mechanism built-in, or you might want to use the Windows PerformanceCounter API calls. If you want an overall view of your application’s performance, along with detailed measurements of all the hotspots in your code, you should consider a tool like the VTune™ Performance Analyzer. You should have a strategy for measuring original code as a benchmark and measuring subsequent changes in your code.

There are several things to bear in mind when you measure the performance of your threaded application. First, you need to understand the execution times of the serial and parallel parts of the code.

Serial time: Your code is running in a single thread, on a single processor core Measure it by enclosing the serial code in timing code
Parallel time: Your code is running on more than one thread, so it can run on more than one processor core at the same time Measure around the threaded section and record thread count; see below

 

Measuring serial code is quite simple, so let’s look at the parallel code.


Figure 1: Sample parallel loop

In the code in Figure 1, measure parallel time around the block shown (including the OpenMP #pragma). Record the thread count as well, so you can see how well your code speeds up from single-core machines to dual-core machines and beyond. An easy way to record thread count is with the omp_get_max_threads() call. See also the discussion of speedup below.

It is rare to add significant threading to an application without some big transformations in the code. Sometimes, whole algorithms change. Often, loops contain data dependencies. To get around dependencies so you can thread, you might need to buffer extra copies of your data for each thread, and add synchronization code to reconcile any differences.

This kind of large-scale change might speed up or slow down your code, depending how you do it. Be careful when considering this approach.

Another temptation while transforming code is to make other changes, like refactoring. This is generally a good idea, but it can make it harder to measure performance changes. As a rule, try to separate these activities from each other. Refactor in one pass, then thread in another (or vice versa).

Measuring this kind of transformation can be tricky. It is hard to measure individual code blocks because you’re actively changing them. The best way to measure transformed code is to have an application level benchmark that you can measure before, during, and after changes.


What overhead does threading add to my code?

Before you can really understand your code’s performance, you’ll want to think about several other factors.

First let’s look at overhead. When threading, we always add some overhead; we add fixed startup and per-thread overhead. In some cases (typically small loops or low iteration counts), those overhead numbers are high enough that it doesn’t make sense to thread that code. In those cases, we’re better off leaving the code in its original serial form.

The following are some overhead factors in your threaded code. Most of the time, we can ignore these factors in OpenMP code, but below the table we’ll explore the cases when they can’t be ignored.

Factor Description How do you detect/measure if it’s a problem? What do you do about it?
Thread library startup overhead One-time overhead when your code starts. Not significant for most code. Bundled with the rest of application startup Outside typical in-code measurements, compare serial with threaded app run N/A

Not tunable by developer
Thread startup overhead Time to create threads. One-ti me cost in OpenMP due to thread pooling. Measure multiple passes over threaded code; see below Thread higher and/or larger loops where possible, to offset this overhead
Per-thread (loop scheduling) overhead Time spent by threading library scheduling chunks of work on each thread Measure only in very performance-sensitive code or unusual scheduling code; see below Adjust schedule in your code; see below
Lock management overhead Time spent managing locks on critical sections. Sometimes used when benchmarking OpenMP implementations against each other5. Watch for your lock calls to be frequently called, using a tool like VTune Performance Analyzer. Most of the time, lock-heavy code has much larger problems with blocking; see below. Reduce blocking to reduce lock contention and management; see below for alternatives when you have low contention


Figure 2: Overhead factors in OpenMP

 

These are the overhead factors in your threaded code. Most of the time, we can ignore these factors in OpenMP code, but let’s explore the cases when they can’t be ignored.

Let’s look at where those various overheads happen.


Figure 3: Thread overhead locations

Measure these overheads where shown in Figure 3. Per-thread and library startup overheads are easy to measure, but it takes several different tests to measure thread startup overhead. First, measure the time to run exactly once through the OpenMP parallel for directive. Then, measure the time to run it twice. Because OpenMP uses thread pooling, the first loop should contain all the thread startup overhead. The second iteration should take much less than 2x the first iteration as a result. The incremental time taken by the second iteration is the “native” time to run one iteration of your loop; you can find the thread startup overhead by subtracting this out of the time taken by the first loop iteration.

Earlier work2 found that thread startup overhead averages around 170-190 microseconds (when creating two threads on a dual-core machine). The good news is that most OpenMP implementations (including those in the latest Intel and Microsoft compilers) use thread pooling. Your code will only pay the thread startup overhead once, the first time threads are used. This makes thread startup overhead very low for most threads and calling patterns.

This is one distinct advantage of OpenMP over typical Windows threading; the OpenMP runtime automatically does thread pooling for you whenever possible. Since typical Windows threading does not use thread pooling, each thread encounters the thread startup overhead. This can be prohibitive for many applications. If you choose Windows threading over OpenMP, you can use Windows threads with a thread pool to avoid this overhead, but that is more complex.

The OpenMP library detects the best number of threads for your code on any given machine. This is another advantage of OpenMP over typical Windows threading.

There are some indications that OpenMP, as implemented in the Intel C++ Compiler, manages locks more efficiently than Windows native threads. This is good news, but probably doesn’t have a big impact on your code. While it can be a good idea to make sure your lock code isn’t in the “hotspot” list in your code (using a tool like VTune Performance Analyzer), it’s rarely an issue. This is because locks are often used where there really is some contention for a resource, so code must block while waiting for that resource. In these cases, blocking is a much more important consideration, so there’s little need to measure lock management overhead.

In typical loop code (moderately large iteration counts or moderately long sets of computations), none of these overheads have much effect. Read on for a discussion of blocking, typically a much more important consideration.


What about blocking?

Blocking is often the biggest factor keeping threaded applications from performing their best. It is a matter of both design and implementation to address blocking. We should design in the lowest practical amount of blocking. We should also test and tune our implementation, to make sure that our code is blocking only where and when we expect.

Ideally, code would spend no time blocking, but it’s rare to eliminate it.

You can get a rough sense of how much blocking will slow your code: Look at the ratio of critical sections to the rest of your code. Large critical sections or high ratios of critical sections to the rest of your code are both signs of trouble.

To see and measure where your code is blocking, how much, and why, it is essential to use a tool that can isolate blocking, like the Intel Thread Profiler.

Sometimes, you have locks around resources that are rarely contended (e.g. reference counts). If you see that locking is taking some significant slice of time and your code is not blocking, consider an optimization like the double-checked locking optimization6. This technique lets you use a lock-free algorithm most of the time, doing away with any significant lock management overhead.


How does your code speed up?

Every time you thread a region of your code, that region has a “native” speedup: The amount that this region could speed up under ideal conditions.

A nice feature of OpenMP is that the runtime will automatically create the best number of threads for your current machine. If you run your code on a single-core processor, it will run all your code on one thread. If you run the same code on a dual-core processor, it will run your code on two threads. This means your code automatically scales on its current machine as best it can, for the best speedup possible.

Consider aga in the code fragment we looked at, above:

 #pragma omp parallel for
for(int i = 0; i < middleLoopCount; i++) {
Work(innerLoopCount);
}


Figure 4: Sample parallel loop

 

What kind of speedup could this loop have - how much faster could this run with an infinite number of threads? Here we have a loop that has ideal possible speedup, limited only by its loop count. We could use any number of threads up to the loop count, middleLoopCount. If we had that many threads, each could run one iteration of the for loop. Say we always call this loop with 1,000 as the middleLoopCount. This code looks like it could run 1,000 times faster, if given 1,000 threads that could run in parallel.

Could it really run that fast, even in ideal cases? We have to look deeper to know. In this case, it depends on what’s in the Work() function. If there is no blocking code in that function, then we still have ideal possible speedup, and this loop turns out to be a great place to thread. What if Work() contains only a large critical section, updating a shared array? Then we have bad possible speedup, and this loop isn’t a good candidate for threading.

In practice, even in cases where code has perfect possible speedup, developers will sometimes artificially limit speedup. This can be done by calling omp_set_num_threads(). This call will set the upper limit of the threads that OpenMP can create. You might wish to set the maximum thread count at the limit of the hardware on which you’ve tested. If you tested your code on machines with up to quad-core processors, then you might want to force it to run on no more than four threads because you have not tested above that point. If you wish to do this in your code, consider making it configurable. When newer hardware with more cores becomes available, you can then quickly test your code and turn on the greater speedup via a switch or new configuration option.

So what does this mean? On today’s hardware, we can run a few threads in parallel (two on a dual-core processor, four on a quad-core processor). We don’t want to create any more threads than we can use; all we do then is spend more time creating threads. If we understand the possible speedup of the various threaded regions in our code, we have a sense of the possible speedup of the whole application. A few small loops with great speedup don’t add up to very much benefit. A main loop with even 1.5x speedup provides a great benefit to your code. Do this exercise to understand your code’s capacity to speed up.

Here’s a different example. We functionally decomposed the code into two functions, each in its own section. (In this case, we artificially split a loop into two subsets using the section directive, so they don’t look very different from each other.)

#pragma omp parallel sections
{
#pragma omp section
for (int i = 0; i < chunkSize; i++) {
Work(innerLoopCount);
}
#pragma omp section
for (int i = chunkSize; i < middleLoopCount; i++) {
Work(innerLoopCount);
}
}


Figure 5: Sample parallel code with different functions

 

How this will speed up depends on the ability of the Work() function to speed up, as before, but also on how well we have balanced the work between the two functions. In the ideal case, this code can reach 2x speedup because it has been split into two subsets.

When faced between the two threading techniques for this code, the answer is clear; thread it the first way (as shown in Figure 4) and get better possible speedup. To understand the effect on more typical code, we need to understand the relative runtimes of the separate sections. In all cases, this code will be limited to (at best) 2x speedup because it has been split into two pieces and may be run on at most 2 threads.

Understanding the possible speedup of your code isn’t always so easy. What about running code that has variable-length loop bodies? You might face this when processing animations on all objects in a scene in a game. There may be a varied number of animations queued for each object, so it’s hard to predict how long each object will be processed.

OpenMP has more options for scheduling variable loops, which let us control the variability we see. See the next section for more info on those options.

You can get a graphical idea of how well your code is speeding up by running it under Intel Thread Profiler.


Scheduling your own loops

In the previous section, we described how variable-duration loop bodies can keep your loops from getting good speedup. How can you tell if you have this problem? You probably already know where in your code this might happen, but you can also observe it with the Intel Thread Profiler. This will give you a picture of how often your code is running with the maximum number of threads. We want to get the threads as balanced as possible, so each thread runs about the same amount of work. This will let us approach the ideal speedup of the code.

Variable-duration loop bodies lead to unbalanced code. It’s very hard to predict the runtime behavior of this kind of code. Even worse, the runtime behavior will vary, so we can’t come up with a static solution.

By default, OpenMP schedules loops by splitting them into equal chunks and running each on its own thread. The default chunks are big enough that the loop finishes once all threads have made one pass through their chunks.

Luckily, OpenM P gives us some scheduling options. By adding schedule (static[, chunk_size]), schedule(dynamic[, chunk_size)), or schedule(guided[, chunk_size)) to the OpenMP directive, we can control how each thread gets scheduled. Experiment with the right scheduling method for your code.

For code with varying loops, like the varied animation example above, the code will probably perform best with a dynamic schedule. Experiment to find the right chunk_size for your code. In some cases, you might get the best performance if you can move the longest-running cases to the start of the loop.

Looking again at our sample parallel loop, we can schedule this a few different ways.

#pragma omp parallel for schedule (static, chunkSize)
for(int i = 0; i < middleLoopCount; i++) {
Work(innerLoopCount);
}


Figure 6: Static schedule

 

Here, we statically ask to issue work in chunks to each thread. When the thread is done with a chunk, it gets another chunk the same size, in round-robin order among all threads. If we don’t add the schedule keyword, OpenMP’s default schedule is a static schedule with the chunkSize evenly dividing the loop iterations among the threads.

We can also make more-flexible scheduling decisions at runtime.

#pragma omp parallel for schedule (dynamic, chunkSize)
for(int i = 0; i < middleLoopCount; i++) {
Work(innerLoopCount);
}


Figure 7: Dynamic schedule

 

In the dynamic case, whenever a thread needs more work, it gets the next chunk of iterations. This does not go in round-robin order like the static schedule does.

The compiler may be able to make some optimizations to the static case that it can’t make in the dynamic case.

Last, we can use a hybrid technique called a guided schedule. Here, each thread gets a big chunk on the first pass, and an increasingly small chunk on the next pass, down to the limit given by chunkSize. That chunk size is dynamically figured when needed; it is like the dynamic schedule with a decreasing chunk size.

#pragma omp parallel for schedule (guided, chunkSize)
for(int i = 0; i < middleLoopCount; i++) {
Work(innerLoopCount);
}


Figure 8: Guided schedule

 

Now, how do these methods compare?


Figure 9: Various scheduling methods on the same balanced code (inner loop count 1000, middle loop count 1000)

Which scheduling technique is best? It depends. This code is well balanced; you can run the same comparison in your unbalanced code to find the right schedule for your code.

Let’s understand the differences in these techniques. First, the chunkSize argument means something different to each schedule. For static, the chunks are given in the same order to each thread, in round-robin order by thread number. For dynamic schedules, the chunks are issued as needed to threads. This method makes more sense for unbalanced code. For guided schedules, each thread gets a large chunk on the first pass, and the chunks get smaller on subsequent chunks, down to a lower limit of chunkSize. In all cases, returning to the thread-scheduling code takes some extra overhead, which we trade off for flexibility. We only change the schedule from the default when we have some other issue (like thread imbalance) that slows down the code. The small extra overhead can make up for any thread imbalance. In all cases, the loop has 1000 iterations split across two threads.

Remember that we’re looking at behavior on 2 threads. Now, what do these numbers tell us?

There’s little difference in the static run times. Smaller chunks take a little longer to run, which is expected because we run through the schedule code more times. The last static measurement in the figure is OpenMP’s default scheduling method. It’s always figured by dividing your iteration count by the number of threads you’re running; in this case, the chunk_size is 500. You should always measure this case as a baseline when comparing schedules in your code.

In fact, the difference between chunks of size 1 and 10 (about 890 microseconds) tells us the static scheduling overhead, or the overhead of setting up and starting each iteration. If we schedule chunks of 1, we must set up and start the loop 500 times (it runs one chunk on each thread). For chunks sized 10, we must set up and start 50 times. Comparing the two, we see that we encounter this overhead an extra 450 times in those 890 microseconds, so the overhead of using this scheduling method is just less than 2 microseconds for each loop run. This is not too bad, but you should still consider using a static schedule with slightly bigger chunks.

Dynamic schedules give the most flexibility, but take the biggest performance hit when scheduled wrong. For this well-balanced code, scheduling chunks of size 1 is clearly the wrong choice; b ut chunks of 10 aren’t bad. We can figure the overhead of dynamic scheduling also; there is a difference of about 8,700 microseconds between chunk sizes 1 and 10. The extra scheduling flexibility takes about 19 microseconds per chunk. This can become significant, so avoid very small chunk sizes unless your code is terribly unbalanced or you have very long-running loop bodies. For your code, the right chunk size would be a tradeoff between running fewer chunks and achieving better thread balance. Try various values and measure which runs fastest in your code.

Guided schedules work best with small chunk sizes as their limit; this gives the most flexibility. It’s not clear why they get worse at bigger chunk sizes, but they can take too long when limited to large chunk sizes. The upper bound for a guided schedule is one-half of the total loop size; for this well-balanced code, this is about equivalent to the static case of splitting the loop into equal parts for one pass.

One way to measure the effectiveness of your schedule policy is to run your imbalanced threads with the default schedule, run with various hand-coded schedules, and compare the speedups given by the schedules.

When measuring schedule policies, make sure you measure them on as wide a variety of machines as possible. Take particular care to compare your schedule on single-core processors to the same schedule on dual-core processors. Your ideal schedule will behave well on both.


Protecting shared variables

In all threading environments, you need to make sure that multiple threads don’t access shared variables at the same time.

OpenMP provides the critical directive, which you use to define a critical section, with the guarantee that only one thread at a time will execute that code. This is the right solution for the general case, where you have some region of code that access a sequence of shared variables.

If you’re using a critical section to protect an increment or decrement of a single variable, there are better ways to do it. You can use the atomic directive and get better runtime in most cases. However, there’s an even better way.

Windows implements a set of Interlocked functions7 for this purpose. These functions are implemented by underlying atomic hardware instructions in the processor, so no application-level locking is required.

  • InterlockedIncrement()
  • InterlockedDecrement()
  • InterlockedExchange()
  • InterlockedExchangeAdd()
  • InterlockedCompareExchange()

 

When you can, you should use these functions instead of the OpenMP atomic or critical directives, for the fastest updates to your shared variables.


Conclusion

This paper summarizes some of the reasons you might thread for performance, and helps you decide on ways to thread and measure your code. We show that OpenMP ove rheads are low, and show how you can get good speedup for most code by using OpenMP.

The paper also discusses some techniques for finding the possible speedups in your code. Finally, the paper closes with some comparisons of various OpenMP thread scheduling methods and tips for protecting shared variables.

OpenMP provides an easy way to thread many applications, with relatively low overhead. Go forth and thread!


Configuration

All tests were run on a PC with a Pentium® D processor running at 3.4 GHz, with 1 GB RAM.

The OS was Microsoft Windows XP Professional SP2*, with various services and power management turned off to promote consistent benchmarking.

All code was developed under Microsoft Visual Studio .NET 2003*, and compiled with the Intel® C++ Compiler 9.0 for Windows.


References

 


Suggested reading

 


Download Code Sample

 

For more complete information about compiler optimizations, see our Optimization Notice.