How to sound like a Parallel Programming Expert Part 4: Writing parallel software

Download Article

Download How to sound like a Parallel Programming Expert Part 4: Writing parallel software [PDF 97KB]

Introduction

The software world has been quite busy with parallel computing. On the MPP front, we developed in the mid 90's, a number of message passing APIs eventually converging on a single standard (MPI). For shared-memory, multiprocessor computing, a similar process unfolded with convergence around two standards;

  • pthreads for explicit control over threads and
  • OpenMP for directive driven multithreading.

And while these standards-based approaches were becoming entrenched, every imaginable flavor of parallel programming was tried: data parallel (HPF), functional (SISAL), concurrent logic programming (Parlog), object oriented programming (POOMA), and many others. In each case, the high level goal was the same … to make parallel programming easier and less error prone.

Regardless of the parallel language, the basic idea behind parallel software is the same. We can think of it in terms of a consistent four-step program:

  1. Find the concurrency in your problem.
  2. Expose the concurrency, i.e. restructure code to make potentially concurrent code truly concurrent.
  3. Exploit the concurrency by adding parallel constructs to your program using a parallel programming notation.
  4. Execute the program in parallel and tune.

Books have been dedicated to these topics. We can't begin to cover the full range of issues pertaining to parallel programming in a brief note. Instead, we will introduce a few key constructs from one of the simpler parallel programming notations (OpenMP) and then provide a couple examples to make our four-step program clear.

OpenMP

OpenMP is a particularly simple parallel programming notation. It works on shared-memory computers. This means you can work with tasks without the need to explicitly decompose your data. However, this also means you can easily create a program with accidental data sharing which can lead to the race conditions we discussed in part 3.

To work with OpenMP, we only need to explain a few basic ideas. The constructs in OpenMP are for the most part expressed as hints to the compiler or "pragmas". If a compiler doesn't recognize a pragma, it ignores it. This means an OpenMP program will compile and run with a non-OpenMP compiler … one source code base serves for two different environments.

An OpenMP program starts as a serial program. At some point when additional threads are needed, these threads are created with a "parallel" pragma

#pragma omp parallel

The team of threads created by the parallel pragma executes the statement following the pragma (which of course can be a compound statement defined by multiple statements within a pair of curly brackets). By default, variables visible before the parallel pragma are shared by all the threads. To create a separate copy of one of these variables for each thread, we'd use the private clause

#pragma omp parallel private(list_of_variables)

If a parallel pragma is used alone, each thread executes the same code. You can split up iterations of a loop among a set of threads by using the "for" pragma

#pragma omp for

To save space in your source code, you can put the for and parallel pragma into a single pragma

#pragma omp parallel for

This causes a team of threads to come into existence and to split up the iterations of the immediately following loop among the threads. Before we show examples of OpenMP code, there are three more clauses we need to introduce. The first helps us manage data seen by a team of threads. It is quite common that you want each thread to have its own copy of a temporary variable. We make this happen with a private clause which can be added to a parallel or a for pragma

#pragma omp for private(list_of_variables)

Another common situation is a reduction. A reduction occurs when the threads carry out operations that are then accumulated into a single variable; in essence, the multiple values computed on each thread are reduced to a single global value

#pragma omp for reduction(+:sum)

All of this may seem a bit confusing at this point, but it should all become clear once we consider a couple examples.

Examples

We will provide two examples in this note.

The first example is a numerical integration problem that should produce an estimate to the number pi. This example is grossly over-used in notes about parallel programming. We keep using this example, however, since it exposes many of the key issues in parallel programming and it is one of the simplest examples that actually speeds-up when run on a parallel computer.

The second example is much more complicated. In this example, we compute sequences of Fibonacci numbers for each item in a linked list.

We will cover the same material in each example. You can pick either one or if you choose both. (Note: all results reported in this note were for an Intel dual-core 1.83 GHz CPU with the Intel IA-32 compiler version 11.0.074).

Example 1: The simple pi program

The first is a trivial example of a numerical integration program. This example is described in detail elsewhere so we shall be brief in this note. The serial version of the program follows:

for (i=1;i<= num_steps; i++){
    x = (i-0.5)*step;
    sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;

This program sums up a series of narrow rectangles to estimate the error under a curve. The curve and the range of values computed are selected so the result should be equal to the number pi. To parallelize this program, we go through the four-step program we listed above:

  1. Find the concurrency in your problem. The iterations of the loop can in principle execute concurrently.
  2. Exposing concurrency: The concurrency can be exposed by giving each thread its own copy of the temporary variable x, and by managing the reduction into the variable sum.
  3. Exploit concurrency: We will parallelize the program with OpenMP by adding the single pragma
    #pragma omp parallel for private(x) reduction(+:sum)
    
  4. Execute in parallel. The complete program follows:
    #pragma omp parallel for private(x) reduction(+:sum)
    for (i=1;i<= num_steps; i++){
        x = (i-0.5)*step;
        sum = sum + 4.0/(1.0+x*x);
    }
    pi = step * sum;
    

The loop control variable "i" is private to each thread by default. The OpenMP pragma creates a team of threads (due to the "parallel" construct) and assigns iterations of the immediately following loop (due to the "for" construct) with each thread having its own copy of the variable "x". This prevents a race condition as multiple threads read and write the "x" variable. The clause "reduction(+:sum)" tells OpenMP to create a separate copy of the variable "sum" for each thread and to assign it to the identify for the indicated operation (zero for addition). The accumulation operations occur into the thread's private copy. Once the loop is complete, the private, per-thread copies are combined into a single global copy.

We ran this program on a dual-core laptop to produce the following results:

Serial code:     Pi = 3.141593 in 1.73  seconds
OpenMP, 1 thread Pi = 3.141593 in 1.73  seconds
OpenMP, 2 thread Pi = 3.141593 in 0.877 seconds 

Note that the OpenMP program with one thread runs in the same time as the original serial program. This is not always the case, in fact it is common for the constructs added to support parallel execution add overhead (hence why many parallel programmers don't like to show the runtime for the original serial code).

Example 2: parallel processing of linked lists

As you move away from scientific computing with its emphasis on solving partial differential equations (the classic "grid" codes), data structures and the associated control structures become much more complicated. Instead of arrays and simple for loops, you have linked lists and while loops.

Consider the following example:

p = head;
while (p != NULL) {
    processwork(p);
    p = p->next;
}

We don't care what the function processwork() actually computes. All we need is for the function to consume substantial amounts of CPU time so we have enough work to offset any parallel overhead. For this example, we let processwork() compute a different sequence of Fibonacci numbers each time it is called.

Let's apply our four-step program to this problem.

  1. Find the concurrency in your problem. Each call to processwork(p) can execute independently and presents the concurrency in this problem.
  2. Exposing concurrency: The concurrency is difficult to expose in this problem. Since we are traversing a linked list, each node in the list depends on the nodes before it. On the surface, this prevents any concurrency in the problem. We can get around this by making extra passes through the list to collect each point in the linked list into an array. Then we can process the elements of this array in parallel.
    // count the number of elements in the list
     p = head; 
     while (p != NULL) {
       	    p = p->next;
             count++;
     }
    
     // put each element from the list into an array
     p = head;
     for(i=0; i<count; i++) {
           parr[i] = p;
           p = p->next;
        }
    
  3. Exploit concurrency: Once we have the nodes in the list in an array, we can parallelize the program with a "parallel for"
        #pragma omp parallel
    
  4. Execute in parallel.

The complete program follows:

p  = head;
while (p != NULL) {
	 p = p->next;
       count++;
 }
 p = head;
 for(i=0; i<count; i++) {
       parr[i] = p;
       p = p->next;
    }
 #pragma omp parallel for  
      for(i=0; i<count; i++)
         processwork(parr[i]);
 }

The original serial program ran in 26.0 seconds. The OpenMP version with all the extra processing to build the array for the elements in the linked list ran in 26.4 seconds using one thread. The fact it took longer to run is not surprising given the extra work we did to expose concurrency. When we used two threads, the runtime decreased to 21.0 seconds.

That shows some speed-up, but not as much as we'd hoped. What is the problem? The 'for' construct in OpenMP will try to place the best schedule of mapping loop iterations onto a thread. A compiler does the best job it can, but it knows nothing about the data. In particular, we expect that for different elements of the list, the effort to compute the Fibonacci numbers may vary significantly from one case to the next. A naive distribution of the work where the first halves of the iterations are given to one thread and the second half to the second thread is unlikely to produce a well balanced load.

We can interact with OpenMP to direct it to use a different schedule. In this case, we will tell it to deal out iterations to the threads in a round-robin fashion; i.e. as you would deal out a deck of cards. This is done by adding the "schedule (static,1)" clause so our pragma reads

#pragma omp parallel for schedule(static, 1) 

With this new schedule, the runtime on two threads drops to 16.6 seconds.

Unfortunately, we had to make way too many changes to the original program to support the parallelism. One of the goals of OpenMP is to support parallel programming with minimal changes to the original serial program. We addressed this problem in the latest version of OpenMP, OpenMP version 3.0, by adding an explicit task construct. The tasks are placed in a queue and a team of threads then processes tasks off the queue until the work is done. The code for the OpenMP 3.0 tasks version of our program follows:

#pragma omp parallel 
{
   #pragma omp single
  {
      p=head;
      while (p) {
	     #pragma omp task firstprivate(p) 
                  processwork(p);
            p = p->next;
      }
   }
}

There are a few extra details to understand. First we used a single construct. This tells the system that only one thread in the team will do the enclosed work. In other words, only one thread runs through the while loop to place tasks on the queue. Once all the tasks are enqueued, the thread then joins in with the rest of the team to compute tasks. The second detail is the task pragma

#pragma omp task firstprivate(p) 

This defines the immediately following statement (the processwork(p) function) as a task to be placed on the queue. The firstprivate(p) clause tells the system to save a copy of the indicated variable with its value at the time the pragma is encountered and to use that value within the task.

This simple code maintains the basic structure of the original program. Once you are used to thinking in terms of queues of explicit tasks, the program is easy to understand. How well does it run? In this case, the program ran in 15.8 seconds. It was not only easier to understand, it ran slightly faster than the other cases.

Conclusion

In these four notes, we have covered a wide range of issues. Staring with a high level description of the parallel programming problem, we discussed parallel hardware in our second note and the key issues every parallel programmer needs to understand in our third note. This final note covered the basic ideas behind parallel programming. While the details can be extremely complicated, the central ideas are straightforward. Everything comes down to exploiting concurrency on parallel hardware to make a problem run in less time. Understanding the content of these notes won't make you a parallel programming expert, but they will at least help you understand the key ideas behind this field. And they will help you to "sound like an expert".

 
 
How to sound like a Parallel Programming Expert Part 3: Parallel computing issues
 

About the Author

  Tim Mattson is a Principle Engineer at Intel working in the Microprocessor Technology laboratory. He is also a parallel programmer. Over the last 20 years, he has used parallel computers to make chemicals react, shake up proteins, find oil, understand genes, and solve many other scientific problems. Tim's long term research goal is to make sequential software rare. This means he has many years of hard work ahead of him.
For more complete information about compiler optimizations, see our Optimization Notice.