More Work-Sharing with OpenMP*

by Richard Gerber


Abstract

As you know, OpenMP* contains a very powerful set of pragmas that help you parallelize a loop. What you may not know is that OpenMP can be used to thread more than just loops. When the "parallel for" construct falls a little short, OpenMP has additional pragmas, constructs, and function calls that come to the rescue.

This is the second in a series of three white papers that teach you, an experienced C/C++ programmer, how to get started using OpenMP, simplifying the creation, synchronization, and deletion of threads in your applications. The first paper introduced you to the most common feature of OpenMP: work sharing for loops. This second paper teaches you how to exploit non-loop parallelism and some of the other general OpenMP features. The final paper discusses the OpenMP runtime library and how to debug your application when things go wrong.


The Parallel Construct

In the first white-paper in this series, "Getting Started with OpenMP*," you saw how the parallel for directive could be used to split iterations of a loop across multiple threads. When OpenMP encounters the parallel for directive, threads get created and the iterations of the loop are divided among them. At the end of the parallel region, threads are suspended, waiting for the next parallel section. This create and suspend functionality consumes some overhead and may be unnecessary when two loops are adjacent as shown in the following example.

#pragma omp parallel for 

for (i=0; i<x; i++)
     
fn1(); 

#pragma omp parallel for 

// adds unnecessary overhead

for (i=0; i<y; i++) 

fn2(); 

 

The overhead can be avoided by entering a parallel section once, then dividing the work within the parallel section. The following code is functionally identical to the preceding code but runs faster because the overhead of entering a parallel section is performed only once.

#pragma omp parallel  

{ 

#pragma omp for  

for (i=0; i<x; i++) 

fn1(); 

#pragma omp for  

for (i=0; i<y; i++)  

fn2();

} 

 

Ideally, all of the performance-critical portions of the application, also known as the hotspots, would be executed within a parallel section. Since very few programs are made up of just loops, additional constructs are used to handle non-loop code.


Sections

The sections construct tells OpenMP to divide the identified sections of your application across the multiple threads. The following example uses work-sharing for loops and sections:

#pragma omp parallel
     
{ 

#pragma omp for 

for (i=0; i<x; 

i++)

Fn1(); 

#pragma omp sections  

{ 

#pragma omp section  

{  

TaskA(); 

}  

#pragma omp section 

{ 

TaskB(); 

} 

#pragma omp section 

{ 

TaskC();  

} 

} 

 

Here, OpenMP first creates a group of threads then divides iterations of the loop among them. When the loop is finished, sections are divided among the threads causing each to be executed exactly once, and in parallel with the other. If the program contains more sections than threads, the remaining sections will get scheduled as threads finish their previous sections. Unlike loop scheduling, OpenMP completely controls how, when, and in what order threads execute the sections. You can still control which variables are shared or private, using the private and reduction clauses in the same fashion as the loop construct.


Barriers and Nowait

Barriers are a form of synchronization that OpenMP uses to synchronize threads. All threads will wait at a barrier until all the threads in the parallel section have reached the same point. You have been using implied barriers without realizing it in the work-sharing for construct. At the end of the loop, an implicit barrier exists that causes execution to wait for all threads to finish the loop before any go on to execute additional work. This barrier can be removed with the nowait clause, as shown in the following code sample.

#pragma omp parallel

{ 

#pragma omp for nowait 

for (i=0; i<100; i++) 

compute_something();

#pragma omp for

for (j=0; j<500; j++) 

more_computations(); 

} 

 

In the previous example, the threads that process the first loop continue immediately to the second loop without waiting for all threads to finish the first loop. Depending upon your situation, this behavior may be very beneficial, because it can reduce the amount of time that threads are idle. The nowait clause can also be used with the sections construct to remove its implicit barrier.

The reverse-adding a barrier-is also supported, as shown below.

#pragma omp parallel 

{ 

// A bunch of work...  

#pragma omp barrier 

//Additional work...  

} 

 

This construct is quite useful when all threads need to finish a task before any more work can be completed, as would be the case in updating a frame buffer before displaying it.


Master and Single

Unfortunately, applications are not just made up of loops and parallel sections. A need to execute something only once, by only one thread, will certainly be required within a parallel region, especially when you are making parallel sections as large as possible to reduce overhead. To meet this need, OpenMP contains a way to specify that a sequence of code contained within a parallel section should be executed just one time and by only one thread. OpenMP decides which single thread will do the execution. When needed, you can however specify that you want only the master thread to execute the code. Below is an example.

#pragma omp parallel 

{ 

do_multiple_times();

// every thread calls this function 

#pragma omp for 

for (i=0; i<100;

i++) 

// this loop is divided among the threads 

fn1(); 

// implicit barrier at the end of the above loop  

// will cause all thread to synchronize here  

#pragma omp master 

fn_for_master_only();  

// only the master thread calls this function  

#pragma omp for nowait 

for (i=0; i<100;  

i++)  

// again, loop is divided among threads 

fn2();

// The above loop does not have a barrier, and 

// threads will not wait for each other. One thread,

// hopefully the first one done with the above loop, 

// will continue and execute the code below. 

#pragma omp single 

one_time_any_thread(); 

// any thread will execute this  

} 

 

Atomic Operations

When executing code in parallel, it is very likely that shared-memory synchronization will be needed. Atomic operations, by definition, are guaranteed not to be interrupted and are useful for statements that update a shared memory location to avoid some race conditions. In the line of code below, the programmer has determined that it's important for variables to remain constant for the duration of the statement.

a[i] += x; // may be interrupted half-complete 

 

While execution of an individual assembly instruction is never interrupted, statements in high-level languages, such as C/C++, may translate into multiple assembly instructions, making interruption possible. In the above example, the value of a[i] has the chance of changing between the assembly statements for reading the value, adding the variable x, and writing the value back to memory. The following OpenMP construct ensures the statement will be executed atomically-wit hout possibility of interruption.

#pragma omp atomic
a[i] += x; // never interrupted because defined
//atomic

 

Atomic operations can be one of the following basic forms on non-overloaded operations.

Expr1++ ++expr1
Expr1-- --Expr1
Expr1 += expr2 Expr1 -= expr2
Expr1 *= expr2 Expr1 /= expr2
Expr1 <<= expr2 Expr1 >>= expr2
Expr1 &= expr2 Expr1 ^=expr2
Expr1 |= expr2

 

OpenMP selects the most efficient method to implement the statement, given operating-system features and hardware capabilities.


Critical Sections

A critical section protects against multiple accesses to a block of code. When a thread encounters a critical section, it only enters when no other threads are in any critical section, that one or others. The following example uses an unnamed critical section.

#pragma omp critical 

{ 

if (max < new_value) 

max = new_value 

}

 

Global, or unnamed, critical sections may unnecessarily impact performance because every thread is effectively competing for the same global critical section. For that reason, OpenMP has named critical sections. Named critical sections allow for more fine-grained synchronization so only the threads that need to block on a particular section will block. The following example improves on the previous example.

#pragma omp critical(maxvalue) 

{

if (max <  

new_value)  

max = new_value  

} 

 

With named critical sections, applications can have multiple critical sections, and threads can be in more than one critical section at a time. It is important to remember that entering nested critical sections runs the risk of deadlock, which OpenMP does not detect. When using multiple crit ical sections, be extra careful to examine those that may be "hiding" in subroutines.

First Private, Last Private

Adding to your toolbox of OpenMP pragmas and functionality are the firstprivate and lastprivate clauses. These clauses copy values between the global "master" copy of a variable to a private temporary variable and vice-versa.

The firstprivate clause tells OpenMP to initialize the value of a private variable with the value of the master variable. Normally, temporary private variables have an undefined initial value saving the performance overhead of the copy. The lastprivate clause copies the value of the variable computed in the sequentially-last iteration to the master variable before destruction. Variables can be declared both firstprivate and lastprivate at the same time.

These clauses can usually be avoided with a small code change. In certain circumstances, however, they are quite useful. For example, the following code converts a color image to black and white.

for (row=0; row<height; row++) 

{  

for (col=0; col<width; col++) 

{  

pGray[col] = (BYTE) 

(pRGB[row].red * 0.299 + 

pRGB[row].green * 0.587 + 

pRGB[row].blue * 0.114); 

} 

pGray += GrayStride; 

pRGB += RGBStride;  

} 

 

But, how to move the pointers to the correct place within the bitmap? We could perform the address calculation for every iteration with the following code:

pDestLoc = pGray + col + row * GrayStride; 

pSrcLoc = pRGB + col + row * RGBStride; 

 

That code, however, executes a lot of extra math on each pixel. Instead, the firstprivate clause can be used as a kind-of initialization, as shown below:

BOOL FirstTime=TRUE; 

#pragma omp parallel for private (col) 

firstprivate(FirstTime, pGray, pRGB) 

for (row=0; row<height; row++) 

{ 

if (FirstTime == TRUE)  

{  

FirstTime = FALSE; 

pRGB += (row * RGBStride); 

pGray += (row * GrayStride); 

} 

for (col=0; col<width; col++) 

{ 

pGray[col] = (BYTE) (pRGB[row].red * 0.299 + 

pRGB[row].green * 0.587 +  

pRGB[row].blue * 0.114);

} 

pGray += GrayStride; 

pRGB += RGBStride; 

}

} 

 


A Tricky Example

Question: Describe the differences in execution between the following two loops.

Loop 1: (parallel for)

int i; 

#pragma omp parallel for 

for (i='a'; i<='z'; i++) 

printf ("%c", i);  

 

 

Loop 2: (just parallel)

int i; 

#pragma omp parallel 

for (i='a'; i<='z'; i++)  

printf ("%c", i); 

 

Answer: The first loop prints the alphabet once, and the second loop prints the alphabet an unknown number of times. It would print the alphabet once per thread if the variable i was declared private.


Summary

OpenMP has a rich set of features that are designed to simplify the programming effort required to thread your applications. A combination of the techniques shown in this white paper and those from the first white paper in this series will parallelize most algorithms.


References

OpenMP Specification: http://www.openmp.org*


Learn More

 


Connect

 


Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.