Using Tasking in OpenMP* 3.0 with the Intel® Fortran Compiler version 11
Earlier versions of OpenMP focused on data parallelism, with only limited support for task-level parallelism expressed statically via explicit code segments (parallel sections). More general support for unstructured parallelism via dynamic task generation was introduced in OpenMP 3.0. Although it is possible to use the new task construct in other contexts, its main value is for applications that do not exhibit data parallelism and that can benefit from the dynamic generation and scheduling of tasks. In this article, the steps necessary to thread such an application are described in detail for a specific example.
A Sample Application
The venerable Whetstone benchmark is used to illustrate how the task-level parallelism introduced in OpenMP 3.0 may be used to improve application performance on multi-core systems. The original version of the Whetstone benchmark was written in Algol (“A Synthetic Benchmark”, H.J. Curnow & B.A. Wichman, Computer Journal Vol 19 #1: 43-49, February 1976). There are publicly available versions in various languages; the Fortran77 double precision version discussed below was obtained from www.netlib.org as whetstoned.f. It does not use any Fortran90 constructs such as MODULE; where the term “module” is used, it refers to the numbered sections of code described as “modules” in the comments:
- Module 1: Simple identifiers
- Module 2: Array elements
- Module 3: Array as parameter
- Module 4: Conditional jumps
- Module 5: Omitted
- Module 6: Integer arithmetic
- Module 7: Trigonometric functions
- Module 8: Procedure calls
- Module 9: Array references
- Module 10: Integer arithmetic
- Module 11: Standard functions
Source code changes
Computers have become orders of magnitude more powerful over the years; therefore, the following changes have been made to whetstoned.f to adapt it for modern compilers and multi-core processors:
The iteration count LOOP has been increased by a factor of 1000, to increase the running time, and the amount by which the scale factors T and T1 differ from 0.5 has been reduced by a factor of 1000 (to reduce the risk of underflows or overflows).
Modules 1 and 10 have been reactivated by changing the loop iteration count from 0 to 1000*LOOP. The following changes were made for subroutine POUT that prints a result summary of each module:
The IMPLICIT statement has been moved before the first declarations, as required by the Fortran language standard. The PRINT statement has been changed to print times in milliseconds instead of seconds and to print out the module number and (when applicable) the OpenMP thread number. The format statement has been adjusted for this and to avoid field overflows.
Where the actual arguments of calls to POUT are undefined in the corresponding module, they have been replaced by arguments M (=-1) or XM (=-1.0). This allows the computed results of different runs to be compared, whether serial or parallel, while avoiding the distraction of undefined variables taking different values in different runs. The source code after these changes may be found in the attached file whetstoned1.f
Most of the computational modules are not susceptible to threading or vectorization inside the module, because of dependencies. For example, in module 1, each iteration of the loop depends on the result of the previous iteration. The natural way to thread the application is to try to execute the individual modules in parallel. In most cases, the individual module calculations are independent: data are initialized locally; results are printed, but are not used elsewhere. The exception is where the results of module 2 are used as input for module 3. For this reason, modules 2 and 3 cannot be executed in parallel, and we have chosen to combine these modules sequentially into a single task for subsequent parallelization.
Variable names such as I,J,K,L, X,Y,Z,E,E1, (but not their values), are reused in multiple modules. Therefore, each module will need to have a private copy of these variables that can not be overwritten by other modules running concurrently in a different thread. The variables J,K,L and E1 appear in blank common and cannot easily be made private to a thread. (Blank common may not appear in a THREADPRIVATE directive, and the earlier elements of the common block, T,T1 and T2, need to be shared between modules). J,K,L and E1 have therefore been removed from common and made into local variables. They are passed into subroutine P0 as arguments instead of via common. The corresponding source file is whetstoned2.f.
The source file may be built with the Intel Fortran compiler for Windows*, Linux* or Mac OS* X, version 11 : ifort whetstoned2.f
However, the compiler inlines most small subroutines and functions, which defeats the object of module 8, and associated interprocedural optimizations allow the compiler to optimize away the loop in module 6. Therefore, to preserve the original intent of the benchmark, it is built with the following options:
ifort –no-ip –fno-inline whetstoned2.f (Linux and Mac OS X) or ifort /Qip- /Ob0 whetstoned2.f (Windows)
Although the benchmark contains some internal calls to ETIME for individual modules, overall benchmark performance is measured by wall clock time, e.g. by running under the Linux “time” command and noting the “real” time, or by running under a Windows equivalent such as “TIMETHIS”.
The benchmark may now be threaded using the tasking model in OpenMP 3.0. Each module is made into a separate task, (except for modules 2 and 3, which are combined), by enclosing it in !$OMP TASK and !$OMP END TASK directives. A DEFAULT(SHARED) clause is added to ensure that read-only variables such as T, T1 and T2 are shared between tasks. A PRIVATE clause is added that specifies those variables for which the task needs its own private copies so that they are not overwritten by other threads.
These tasks will be executed in parallel only if they bind to an enclosing OpenMP parallel region and its team of threads. Therefore, !$OMP PARALLEL directives are inserted immediately before the first task and after the last task. However, this is not sufficient – since there is no OpenMP worksharing construct, each thread will execute the entire parallel region, and so each task would be executed a number of times equal to the total number of threads! Evidently, this isn’t what we wanted. We introduce a !$OMP SINGLE directive immediately after the PARALLEL directive, and !$OMP END SINGLE immediately before END PARALLEL. This ensures that all the tasks are created and queued exactly once by a single thread; however, a task may then be executed by any thread as it becomes available. The code with the added directives is whetstone_task1.f. It should be built with OpenMP enabled, e.g. :
for Linux or Mac OS X: $ ifort –no-ip –fno-inline –openmp –openmp-report2 whetstone_task1.f $ time ./a.out or for Windows: >> ifort /Qip- /Ob0 /Qopenmp /Qopenmp-report2 whetstone_task1.f >> C:\"Program Files\Resource Kit"\timethis whetstone_task1
On a system with 4 cores, this typically results in about a factor of 2 speedup. The number of threads may be controlled by the environment variable OMP_NUM_THREADS. If not set, it defaults to the number of logical processors. It is recommended to set also the KMP_AFFINITY environment variable, especially on systems with hyperthreading where all threads may not be saturated, e.g.
export KMP_AFFINITY=scatter (Linux or Mac OS X) set KMP_AFFINITY=scatter (Windows)
Note that a program threaded with OpenMP can still be built and executed as a serial program. Explicit calls into the OpenMP runtime library must be protected, either by OpenMP conditional compilation using the “!$” sentinel, as here, or by linking to a library of function stubs, done by replacing –openmp by –openmp-stubs (or /Qopenmp by /Qopenmp-stubs).
OpenMP Parallel Sections
The code may be threaded in a similar fashion to the above using OpenMP parallel sections. The SINGLE and END SINGLE directives are replaced by SECTIONS and END SECTIONS directives. The !$OMP TASK directives are replaced by !$OMP SECTION directives, and the END TASK directives are simply removed. In this case, all the private variables for the whole parallel region must be declared on the !$OMP PARALLEL directive. The corresponding code may be seen in whetstone_sect.f. It may be built and executed in the same way as the TASK version of the code. The speedup is typically similar, but slightly less, since the TASK queueing mechanism gives more flexibility for load balancing between the different modules.
OpenMP Tasking Revisited
OpenMP tasks are more powerful and flexible than parallel sections because they may be created dynamically without being tied to a separate piece of source code. Threading the processing of a linked list is an often cited example. We can see a simple illustration in the Whetstone tasking example by changing the loop iteration counts from
LOOP = 1000000 and II = 1 to LOOP = 500000 and II = 2
(This of course changes the printed results, which are only for the last value of JJ).
With the parallel region inside the loop over JJ, nothing changes. The parallel region with its tasks or sections is simply executed twice, with a synchronization point at the end of the parallel region each time. But if the PARALLEL and SINGLE directives (and the corresponding END PARALLEL and END SINGLE) are moved outside the loop over JJ, then all the tasks for both iterations are submitted to a single queue, which gives the runtime library more opportunities for load balancing, especially in cases when there are substantial variations in the durations of the different tasks. When the loop variable JJ is tested inside an individual task, it may no longer have the same value as when the task was spawned. This can be corrected by adding a FIRSTPRIVATE(JJ) clause to each TASK directive. The resulting code may be seen in whetstone_task2.f. When built and executed as before, it may show about a 10% speedup compared to the first tasking example, whetstone_task1.f.
The Intel Fortran Compiler, version 11, supports OpenMP 3.0, which provides more powerful, dynamic ways to thread applications that exhibit task parallelism but not data parallelism.