Intel® Threading Tools and OpenMP*

Introduction

Find where parallelism can be implemented effectively within a serial application.

Explicit threading methods, such as Windows* threads or POSIX* threads, use library calls to create, manage, and synchronize threads. Use of explicit threads requires an almost complete restructuring of affected code. On the other hand, OpenMP* is a set of pragmas, API functions, and environment variables that enable you to incorporate threads into your applications at a relatively high level. The OpenMP pragmas are used to denote regions in the code that can be run concurrently. An OpenMP-compliant compiler transforms the code and inserts the proper function calls to execute these regions in parallel. In most cases, the serial logic of the original code can be preserved and is easily recovered by ignoring the OpenMP pragmas at compilation time.

To illustrate our points more concretely, we have chosen to analyze code that implements a brute force algorithm to find prime numbers within a user-defined range of integers. The serial code takes each potential prime (even numbers are not considered) and divides it by all the integers less than or equal to the square root of the number. If any of the test factors evenly divides the number under consideration, the number is composite; if none of the factors evenly divides the number, it is prime. Numbers found to be prime can optionally be printed out, but a count of the number of primes found is always computed. It is known that primes greater than 2 can be uniquely classified into two categories: those that can be factored into the form 4n+1 and those of the form 4n-1. In addition to counting the total number of primes found, a count for the relevant class-by finding the remainder of division by 4-of primes is incremented for each prime. The serial code, PrimeFinder, to be used is below.

 ```#include #include main(int argc, char *argv[]) { int i, j, limit; int start, end; /* range of numbers to search */ int number_of_prim es=0; /* number of primes found */ int number_of_41primes=0;/* number of 4n+1 primes found */ int number_of_43primes=0;/* number of 4n-1 primes found */ int prime; /* is the number prime? */ int print_primes=0; /* should each prime be printed? */ start = atoi(argv[1]); end = atoi(argv[2]); if (!(start % 2)) start++; if (argc == 4 && atoi(argv[3]) != 0) print_primes = 1; printf("Range to check for Primes: %d - %d ",start, end); for(i = start; i <= end; i += 2) { limit = (int) sqrt((float)i) + 1; prime = 1; /* assume number is prime */ j = 3; while (prime && (j <= limit)) { if (i%j == 0) prime = 0; j += 2; } if (prime) { if (print_primes) printf("%5d is prime ",i); number_of_primes++; if (i%4 == 1) number_of_41primes++; if (i%4 == 3) number_of_43primes++; } } printf(" Program Done. %d primes found ",number_of_primes); printf(" Number of 4n+1 primes found: %d ",number_of_41primes); printf(" Number of 4n-1 primes found: %d ",number_of_43primes); } ```

Intel® Thread Checker as OpenMP* Programming Assistant

Within the given code there is really only one logical place to insert OpenMP pragmas: the main computational for-loop. The modified code at the start of the for-loop is:

 ```#pragma omp parallel for for(i = start; i <= end; i += 2) { ```

By default, all variables, except the loop iteration variable, are shared. More often than not, some threads will need private copies of certain variables in order to avoid data races. In other cases, the logic of the program will be better served if access to some of these variables is synchronized. Before we can determine how best to protect shared variable access, we must recognize which variables need such protection.

With such a small example, we would expect programmers with even a modicum of OpenMP experience to take no more than 30 seconds to identify the variables that need to be protected; in another half minute, an adequate means of implementing that protection can be formulated. However, consider a much larger piece of code where the parallel region is hundreds or thousands of lines long or involves many different function calls whose parameters can be referenced through pointers and different names. Tracking down potential storage conflicts is now a much more daunting task.

Fortunately, Intel Thread Checker automatically identifies the variables that need some form of exclusive access. With only the addition of the pragma shown above and running the code t hrough Thread Checker, the variables limit, prime, j, number_of_primes, number_of_43primes, and number_of_41primes have all been found to cause storage conflicts in the absence of some form of synchronization. See the Intel Thread Checker screen shot below.

click image for larger view

By looking at the source code and the intended usage for each variable, we can determine how best to modify the original source code to implement the required scoping of variables.

Any variable that is written in the parallel region before being read, and whose value is not needed beyond the parallel region, should be private. For the PrimeFinder* example code, limit, prime and j are just such variables and are used only as workspace or temporary variables within the parallel region. Thus, we can distribute copies to each thread by use of the private clause on the OpenMP pragma. The remaining three counter variables are used to hold a global sum for printing after the parallel region. In this case, we shall leave them as shared, but perform the increments of these counters within a critical section. The resulting parallel region code is:

 ```#pragma omp parallel for private (limit, j, prime) for(i = start; i <= end; i += 2) { limit = (int) sqrt((float)i) + 1; prime = 1; /* assume number is prime */ j = 3; while (prime && (j <= limit)) { if (i%j == 0) prime = 0; j += 2; } if (prime) { if (print_primes) printf("%5d is prime ",i); #pragma critical { number_of_primes++; if (i%4 == 1) number_of_41primes++; if (i%4 == 3) number_of_43primes++; } } } ``` ` `

Running this code through Intel Thread Checker shows no more error diagnostics. We have created correctly threaded code. An alternative to using the private clause is to make the affected variables local to the for-loop and, consequently, the parallel region. This is a more elegant solution if these variables are not used anywhere else in the code. An added advantage to this alternative implementation is that the serial code more closely matches the parallel one with regard to the variables.

Besides finding those variables that need protection, Intel Thread Checker can also determine if a section of code is even a candidate for parallelization. Again, with long code sections or those that have a deep call stack, it is extremely tedious and time-consuming to determine if any dependencies exist within potential parallel loops. Dependencies such as an induction variable (a variable that is incremented at each iteration of a loop) or recurrence relation (accessing information computed on a previous loop iteration) inhibit correct parallelization without some algorithmic modifications to eliminate the dependence. Intel Thread Checker points out storage conflicts and programmer examination of the code confirms that the use of the noted variable does constitute a loop dependence.

Performance Tuning with Thread Profiler

After a correctly threaded code has been created, the performance of that code should be gauged. The serial and threaded execution times can easily be compared. If the threaded time is half that of the serial time when run with two threads on a dual-core system, you have done a great job implementing parallelism. If the threaded execution time is closer to (or even greater than) the serial time, certain questions arise. Do large segments of code still execute in serial? Does the required synchronization adversely affect the execution performance? Is the amount of work per thread properly balanced?

Thread Profiler for OpenMP can be used to answer these questions and guide the programmer to points within the code that could be improved and lead to better parallel performance. Because of the structured nature of OpenMP, Thread Profiler is able to make assumptions about the execution model for the application and point out very specific performance problems. Two such common problems are load imbalance and synchronization overhead. We shall look at how Thread Profiler identifies both of these and discuss some possible solutions.

An idle processor during parallel computation is a wasted resource. Similarly, idle threads are wasted resources and can adversely affect the overall runtime of parallel execution. By default, at the end of each OpenMP parallel region or worksharing region, threads wait at an implicit barrier until all threads have completed the assigned work of the region. When unequal amounts of computation are assigned to threads, threads with less to execute sit idle at the region barrier until those threads with more to do have finished.

Running the modified PrimeFinder code through Thread Profiler with four threads on a dual-core processor, with Hyper-Threading Technology (HT Technology), we find that a relatively significant portion of time (14%) was spent with idle threads. Since there is only a single parallel region in this example code, it is obvious where the imbalance occurs. However, when trying to track down the source of an imbalance in more complex codes, the Regions View is used to find the region(s) that assigns threads insufficient work. Selecting a bar from a parallel region, you can find the source code of the corresponding region.

To better determine the cause of the imbalance, consult the Threads View for the selected problem region. This view, from the example code, is given below.

click image for larger view

The stair-step pattern of decreasing imbalance time is common and indicates that the workload, though split evenly as far as the number of tasks per thread, is monotonically increasing the amount of computation required. In this case, the number of integers to be checked for primality is evenly divided by the default static distribution of iterations. However, the number of factors that need to be checked increases with the size of the integer. Thus, the thread that checks the last quarter of the integer range performs more computation than the previous three threads.

This pattern of imbalance also indicates that the sizes and order of tasks are fixed. We can correct the observed imbalance by setting up a more dynamic distribution of tasks, or integers to be checked. Adding a schedule clause to the parallel-for pragma can give you more control on how iterations are assigned to threads. Using a dynamic schedule with a large enough chunk size, for example, schedule(dynamic,100), disseminates iterations as needed for more even distribution of computation, but also gives enough work per chunk to keep the scheduling overhead low. A less obvious schedule for the imbalance of the example code is schedule(static,1). This schedule deals out iterations to threads like dealing out playing cards: one iteration per thread, cycling through the threads in a round-robin fashion, until all iterations have been assigned. Below is the Thread Profiler Summary View of the code using the dynamic scheduling of integers to be tested.

If the imbalance is more evenly spread across the threads, especially for regions that are performed multiple times during the course of execution, then the load is likely changing from one pass to the next. That is, there are some larger tasks being assigned to threads, but there may be no means to predict which tasks these will be. To remedy this situation, try smaller tasks that are dynamically assigned to threads.

Synchronization Impact

While it is very small in the new version of the test code, there still remains a portion of time that is spent by threads waiting for synchronization. As with load imbalance, the example code is simple enough to know exactly where the threads are contending for synchronization. For more complex situations, use the Regions View in Thread Profiler to determine which parallel region(s) contain such conflicts and focus your attention there.

Sections of code that are protected by synchronization should be as small as possible and still maintain correct code. Practice of this principle minimizes the idle time threads must spend waiting to gain access to the protected code sections. For the example code, there are no extraneous statements that do not need exclusive access by threads. Each statement could be placed into a separate critical section. In this case, though, we should use named critical sections since all unnamed critical sections are considered the same under OpenMP, even if they appear in completely different functions and source files. To use named critical sections with the example code, change the increments of the counters to be:

 ```#pragma omp critical (cs1) number_of_primes++; #pragma omp critical (cs2) if (i%4 == 1) number_of_41primes++; #pragma omp critical (cs3) if (i%4 == 3) number_of_43primes++; ```

With four threads and three critical sections, there is still a good chance that at least one thread will be waiting to enter one of the critical sections. Plus, the overhead of entering and exiting multiple critical sections has been tripled. Data from Thread Profiler indicated that the percentage of time with threads waiting for locks and parallel overhead doubled when using three named critical sections over the original three-line critical section.

Another rule of thumb to follow is to not place synchronizations within loops. There are three options to implement this advice for the example code: local variables; atomic operations; the OpenMP reduction clause.

Temporary variables for each counter, declared to create a private copy within each thread, are incremented within the loop. Before the loop is exited, the local values are added to the global values all within a single critical section after the worksharing construct. Thus, there is only one point in the code where threads may delay other threads. One drawback of this method is that the serial code needs to be altered to accommodate the parallel requirements.

 ```#pragma omp parallel { int numPrimes=0; /* local number of primes found */ int num41Primes=0; /* local number of 4n+1 primes found */ int num43Primes=0; /* local number of 4n-1 primes found */ #pragma omp for schedule(dynamic,100) for(i = start; i <= end; i += 2) { int limit, j, prime; limit = (int) sqrt((float)i) + 1; prime = 1; /* assume number is prime */ j = 3; while (prime && (j <= limit)) { if (i%j == 0) prime = 0; j += 2; } if (prime) { if (print_primes) printf("%5d is prime ",i); numPrimes++; if (i%4 == 1) num41Primes++; if (i%4 == 3) num43Primes++; } } // end for #pragma omp critical { // Update global counter values with local values number_of_primes += numPrimes; number_of_41primes += num41Primes; number_of_43primes += num43Primes; } } // end parallel region ```

A second implementation utilizes atomic execution to prevent storage conflicts and reduce impact of synchronization. An obvious means to do this is to use the OpenMP atomi c construct. The increment of the three counters is one of the accepted forms that can be used with atomic. In fact, the previous solution could also make use of atomic constructs to protect the global updates with the private, partial sums from each thread. On Windows, the InterlockedIncrement intrinsic could be used to perform the atomic increment of the three shared counters.

 ```#pragma omp parallel { int numPrimes=0; /* local number of primes found */ int num41Primes=0; /* local number of 4n+1 primes found */ int num43Primes=0; /* local number of 4n-1 primes found */ #pragma omp for schedule(dynamic,100) for(i = start; i <= end; i += 2) { int limit, j, prime; limit = (int) sqrt((float)i) + 1; prime = 1; /* assume number is prime */ j = 3; while (prime && (j <= limit)) { if (i%j == 0) prime = 0; j += 2; } if (prime) { if (print_primes) printf("%5d is prime ",i); #pragma omp atomic number_of_primes++; if (i%4 == 1) { #pragma omp atomic number_of_41primes++; } #pragma omp atomic if (i%4 == 3) { #pragma omp atomic number_of_43primes++; } } } // end for } // end parallel region ```

The third method also uses the power of OpenMP to perform these same operations described in the first method, but not alter any of the serial code. The OpenMP reduction clause creates private copies of the shared variables, computes with those private copies in each thread, and then combines all the partial results back into the original variables at the end of the parallel region. The syntax of the reduction clause requires an associative binary operator along with the name of variables to be combined with the operator at the completion of the parallel region. Thread Profiler results after making this change to the example code (see code below for details) yields an almost perfect (99.984%) parallel execution.

Summary

After following all the advice above, the final parallel region from the example code is:

 ```#pragma omp parallel for schedule(dynamic,100) reduction (+:number_of_primes,number_of_41primes,number_of_43primes) for (i = start; i <= end; i += 2) { int limit, k, prime; // locally declared for private limit = (int) sqrt((float )i) + 1; prime = 1; /* assume number is prime */ k = 3; while (prime && (k <= limit)) { if (i%k == 0) prime = 0; k += 2; } if (prime) { if (print_primes) printf("%5d is prime ",i); number_of_primes++; if (i%4 == 1) number_of_41primes++; if (i%4 == 3) number_of_43primes++; } } ```

To arrive at this final code, we used the Intel Thread Checker to first determine whether or not a loop in the serial code could be made to execute in parallel, and what variables would need to be made private or protected with exclusive access. After the initial code modifications were done, results from running the code through Thread Profiler allowed tuning of the parallel execution to make full use of the computational resources of our system.

The Intel Threading Tools are often touted as being able to find threading errors in code quickly and point out performance bottlenecks that may not be apparent. By using these tools earlier in the development cycle, you can automate some of the more tedious tasks required to find where parallelism can be implemented effectively within a serial application.