Threading Fortran Applications for Parallel Performance on Multi-Core Systems

Abstract

Most processors now come with multiple cores and future increases in performance are expected to come from increases in core count whether those are cores that process one instruction at a time or many simultaneously with SIMD (single instruction, multiple data) instructions. Performance sensitive applications that neglect the opportunities presented by additional cores will soon be left behind. This article discusses ways for an existing, serial Fortran application to take advantage of these opportunities on a single, shared memory system with multiple cores. Issues addressed include data layout, thread safety, performance and debugging. Intel® provides the compiler and software tools that can help in the development of robust, parallel applications that scale well.

Levels of Parallelism

Applications take advantage of parallelism at many levels. The most basic is at the processor level, often called instruction-level parallelism (ILP). The compiler and processor work together to find operations that can be performed simultaneously. A developer with advanced skills who wants to get every drop of performance from an application can influence ILP. Vectorization is a special kind of ILP where the compiler locates a single operation that can be performed simultaneously on multiple data values via special hardware instruction sets in the processor. This is referred to as SIMD processing. Intel processors with the SSE and AVX families of instruction sets are capable of SIMD processing.

Parallelism can also be implemented at the thread level using shared memory. Fortran compilers today will do what they can when the developer sets the compiler option and diligently create a report explaining which DO-loops were parallelized and which were not. To get more parallelism, the developer can use POSIX threads (Pthreads) or insert OpenMP* threading directives into the source to give the compiler explicit instructions. Using OpenMP directives is more straightforward.

Distributed memory parallelism enables a single program to execute multiple processes on a network of computers simultaneously. Nothing automatic about this level of parallelism using Message Passing Interface (MPI). It’s all up to the developer to add calls to routines that move the data between the computers. Coarray Fortran (CAF) as described in the Fortran 2008 standard supports distributed memory parallelism without the programmer inserting the calls to move the data between computers. The compiler takes care of that for you.

Each of these kinds of parallelization is independent and can be combined. For example, an application using OpenMP directives can get a performance benefit from ILP and vectorization, too. And the application that every developer dreams of is embarrassingly parallel. The separate tasks are easy to identify and there is little communication or coordination required between the threads that perform the tasks.

Ways to Introduce Parallelism

MKL

Before jumping in to parallelize your program, be aware that threaded libraries are available. Calling a threaded library is a whole lot easier than threading your own. The Intel® Math Kernel Library (Intel® MKL) contains many standard libraries implemented using OpenMP to take advantage of the Intel multi-core architecture. Basic Linear Algebra Subprograms (BLAS) Level 1, 2, and 3, Linear Algebra Package (LAPACK), Sparse BLAS, Fast Fourier Transforms (FFT), direct sparse solvers and vector math and random number functions are included. Get started here.

Vectorization

Review the Vectorization Cookbook for helpful information on getting started with vectorization.

POSIX Threads

POSIX threads (Pthreads) are language independent and mostly used for task-level parallelism. Pthreads are a challenge to program with and to debug. There is no direct interface from Fortran to the Pthreads API. Wrappers written in C and called from Fortran are required adding another layer of complexity.

Automatic Parallelization

The Intel Fortran compiler can thread simple loops automatically. Compile your code with -parallel -O2 (/Qparallel /O2 on Windows). A report listing each loop and whether it parallelized or not is quite useful. If the compiler decided to not parallelize a loop, there’s an explanation. Add -qopt-report-phase=par (/Qopt-report-phase:par on Windows) to generate the report.

Some tuning to the cost model for parallelization may be done with -par-threshold[n] (/Qpar-threshhold[[:]n] for Windows) where n=100 is the default. When n=100, the compiler will thread (parallelize) only if there is a high probability of performance improvement. When n=0, it will parallelize whenever it is safe to do so, irrespective of performance predictions (useful for testing).

Conditions for Auto-Parallelization

Today’s Intel compiler will automatically parallelize simple loops when -parallel -O2 (/Qparallel /O2 on Windows) compiler options are used. The requirements of automatically parallelizing a loop are similar to those for auto-vectorizing a loop.

DO-loops should be simple, no jumps into or out of a loop. The loop trip count needs to be known at runtime. DO WHILE loops are not eligible for auto-parallelization.

The loop iterations must be independent of each other with no constructs that show a possible data dependency like

 X(I+1) = Y(I+1) + X(I)

Simple loops with function calls that can be inlined by the compiler are eligible for auto-parallelization.

Be wary of array pointers and possible aliasing (accessing the same array element via different pointers). The compiler is conservative and won’t take any risk on overlapping memory locations.

The compiler will recognize reduction idioms like summation and parallelize those. Be aware that rounding differences may lead to slight differences in results when parallelizing floating-point summations.

There is overhead to start up and end parallel loops, so there must be enough work to do to make parallelization worthwhile.

DO-loops may be auto-parallelized AND auto-vectorized. The below Matrix Multiply subroutine is a good demonstration of this. The coder added the comments in the subroutine to show what the compiler did. Read farther for the key points from the optimization report.

subroutine matmul(a,b,c,n)
real(8) a(n,n),b(n,n),c(n,n)
c=0.d0
do i=1,n         ! Outer loop is parallelized.
   do j=1,n      ! inner loops are interchanged
      do k=1,n   ! new inner loop is vectorized 
         c(j,i)=c(j,i)+a(k,i)*b(j,k)
      enddo
   enddo
enddo
end

$ ifort -O2 -parallel -qopt-report -c matmul.f90
… snip…
LOOP BEGIN at matmul.f90(4,1)
   remark #25444: Loopnest Interchanged: ( 1 2 3 ) --> ( 1 3 2 )
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at matmul.f90(6,7)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at matmul.f90(5,4)
      <Peeled loop for vectorization>
      LOOP END

      LOOP BEGIN at matmul.f90(5,4)
         remark #15301: PERMUTED LOOP WAS VECTORIZED
      LOOP END

      LOOP BEGIN at matmul.f90(5,4)
      <Alternate Alignment Vectorized Loop>
      LOOP END

      LOOP BEGIN at matmul.f90(5,4)
      <Remainder loop for vectorization>
         remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
      LOOP END
   LOOP END
LOOP END 
…snip…

OpenMP

For additional parallelism, OpenMP directives can be added to your program. OpenMP facilitates data-level, including SIMD, parallelism with support for task parallelism. OpenMP is a standard API and portable. Intel endeavors to implement the latest standards from the OpenMP ARB (Architecture Review Board). With OpenMP directives, the sequential program still exists and is runnable. There is no need to maintain two versions of the program. The parts of the program that are performance critical are easily bracketed with OpenMP directives allowing incremental parallelism.

OpenMP works off of the fork-join model of parallelism. Threads are spawned by a master thread as needed, i.e. the sequential program evolves into a parallel program. When a task is finished, its thread is returned to the pool of threads ready for the next part of the program that runs in parallel.

Where to begin parallelizing? First, determine where your program spends most of its time. Use Intel® Advisor and/or Intel® Vtune™ Amplifier to gain insight. Amdahl’s Law applies to the speedup you can expect. Speedup is limited by the serial part of the program.

For best performance improvement, look at the high level for coarse grain parallelism. That is the outer loop of nested loops, the slowest varying grid coordinate, or high level driver routines. The high level parallelism reduces the overhead of starting and stopping threads and improves data locality and data reuse by each thread. Some loops are not good candidates due to the data dependence between iterations.

Look for areas of data parallelism. It’s easier to balance the computation across threads and to use more cores.

Examine the basic structure of this numerical integration algorithm. It calculates the electrostatic potential at a series of points in a plane due to a uniform square distribution of charge. Essentially, this is a two dimensional (2D) integral over the square.

DO Square_charge loops over points
   DO TwoD_int integrate over y 
      DO Trap_int integrate over x 
         sum = sum + Func(calculates 1/r potential)

Conceptually, one way to parallelize this algorithm is to first inline func(). Then vectorize the loop over x. At the next level you could thread over y or thread that outer loop over points.

OpenMP: How Do Threads Interact?

OpenMP is a shared memory model. Threads communicate by sharing variables. It is critical to identify which variables are shared between threads and which need a separate copy for each thread. A coding tip for Fortran developers for ease of maintenance is to make shared data explicitly global in MODULES or COMMON blocks and thread private data as local or automatic. Dynamically allocated data (malloc or ALLOCATE) can be used. Allocate shared data in the serial portion of the application. Data allocated inside a parallel region is available only to that thread. Data can also be declared THREADPRIVATE.

Unintended sharing of data between threads can cause a race condition, a state when the outcome of a program changes when the threads are scheduled differently. Race conditions can be eliminated with synchronization, but at a performance penalty. The best way is to eliminate the race condition is to change how the data is accessed. Each thread gets its own private stack, but the heap is shared by all threads. Data on the heap is expensive to keep thread-safe; locks are required.

Explicit parallel regions are self-documenting. They are bracketed with !$OMP PARALLEL and !$OMP END PARALLEL or, in the case of a work-sharing construct like !$OMP PARALLEL DO, the parallel region ends with the corresponding ENDDO. Functions and subroutines called from within a parallel region are harder to identify as they might not contain any OpenMP directives, but they still need to be thread-safe.

As defined by the OpenMP standard, inside a parallel region all data defaults to shared, except loop indices. Data that is really local and not shared need to be declared PRIVATE. Global data can also be declared as THREADPRIVATE. Another coding tip is to specifically declare each variable in the parallel region as shared or private and the code becomes self-documenting!

Often inside a parallel region initial values are undefined. Use FIRSTPRIVATE to initialize a variable to the value the variable had entering the loop or COPYIN for global variables to have meaningful data to compute with.

Conversely, after the end of a parallel region final values are undefined, LASTPRIVATE will carry the last value out of the parallel region.

Thread Safety

A thread-safe function can be called simultaneously from multiple threads and still give correct results. Potential conflicts must either be protected (synchronized) or avoided (by privatization).

Two concepts regarding data must be defined before going further.

  1. “Static local data” is shared by each thread, i.e. each thread may access the same data location! This is potentially unsafe.
  2. “Automatic data” is independent between each thread; it has its own, independent copy on its own stack.

With Intel Fortran when compiled and running serially, the defaults are local scalar variables are automatic and local arrays are static. But when compiling with -qopenmp, local arrays are automatic by default. It is the same as compiling with -auto. This may require an increase in the maximum stack size. If the stack size is too small, there will be a segmentation fault at runtime.

Making a Function or Subroutine Thread-Safe

To make a function thread-safe the developer has two options: add an option to the compiler command line or add clarification in the source code.

With the compiler command line, add -qopenmp or -auto (/Qopenmp or /auto for Windows). -auto may have less impact on serial optimization.

In the source, add the keyword AUTOMATIC to the necessary variable declarations, but that doesn’t cover compiler generated temporary variables. The best way to make a code safe other than using the command line is to declare the whole function or subroutine RECURSIVE. The whole function is covered then, including compiler-generated code.

Whichever way you make a routine thread-safe, don’t use the compiler option, -save, nor use the keyword SAVE. Those make all variables static. Don't initialize data with a Fortran DATA statement for a thread-safe routine either; that makes the variable static.

Avoid writing to global variables unless the writes are synchronized.

OpenMP has various synchronization constructs to protect operations that are potentially unsafe when run in parallel. To force allowing only one thread through a block of code at a time add

!$OMP CRITICAL -or- 
!$OMP SINGLE 

to the source as needed. When only a single statement should be executed asynchronously, the !$OMP ATOMIC directive may be useful.

Another often used OpenMP clause is the REDUCTION clause. The developer has the choice of several operators, but the variables in the REDUCTION clause must be shared. The compiler takes care of managing threads and any synchronization required.

Thread Safe Libraries

The Intel Math Kernel library is thread-safe, both the sequential and threaded versions. You can call the threaded version from a parallel region, but be careful how many threads you end up using. Oversubscribing the processor with too many threads will slow the performance.

If you compile the Fortran main program with -qopenmp (/Qopenmp for Windows), the thread-safe version is linked in automatically.

If the main program is written in C or C++ with Fortran routines, compile with -qopenmp (/Qopenmp for Windows) and call the routine FOR_RTL_INIT to initialize the Fortran runtime environment.

Performance Considerations

To get the best possible performance from your application start with optimized serial code with functions and subroutines inlined, inner loops vectorized, etc. For example, use -O3 -ipo.

Use the analysis tools to see where time is being consumed, the hot spots. Focus there to determine if there is sufficient parallel work.

Minimize the amount of data shared between threads, unless it is read only. The time to synchronize the threads can slow down performance.

False sharing of cache lines slows performance. In this code snippet,

!$OMP parallel do
  do i=1,nthreads
     do j=1,1000
       A(i,j) = A(i,j) + …
  enddo

each thread thinks its copy of A(i,j) may have been invalidated. Reversing subscripts of A improves data locality for each thread. Contiguous memory access may also improve vectorization of inner loop.

One more item to consider that may improve OpenMP performance is to determine if the workload is evenly distributed across loop iterations. If not, consider other scheduling options, like DYNAMIC or GUIDED. STATIC is the default and has lowest overhead.

Timers for Threaded Applications

Be aware that when it comes to CPU TIME, not all timers are created equal.

The Fortran standard timer, CPU_TIME(), returns “processor time”. That’s the sum of the time over all threads, just like the Linux “time” command. That makes it appear that threaded applications don’t run faster than serial ones.

The Fortran intrinsic subroutine SYSTEM_CLOCK() returns data from the real time clock and does not sum over all of the threads. It reports elapsed time like the Linux “time” command. SYSTEM_CLOCK can be used to measure time for a portion of code and with multiple timings with different numbers of threads, speed-up can be calculated.

             call system_clock(count1, count_rate)
                -- compute good stuff --
             call system_clock(count2, count_rate)
             time = (count2 - count1) / count_rate

DCLOCK(), an Intel-specific function, returns elapsed time in seconds since the current process started, can also be used. It is especially valuable for timing intervals longer than 24 hours. SECNDS() also measures elapsed time, but use it code segments with timing intervals of less than 24 hours.

Thread Affinity Interface

A consideration for performance of an application parallelized using OpenMP directives is placement of the threads on the physical processors, thread affinity. Memory placement and cache utilization can be key to a quicker solution.

The environment variable OMP_PROC_BIND can be set to

  • close      to assign threads to consecutive cores on same socket for a possible benefit from shared cache
  • spread   to place threads on separate cores until all cores have at least one thread

Other environment variables for more finely placing threads include KMP_AFFINITY, OMP_PLACES, and KMP_HW_SUBSET and are documented in the Intel Fortran Developer Guide.

Careful placement of threads, especially when hyperthreading is enabled, can improve performance.

NUMA Considerations

Computers with multiple, multi-core processors have Non-Uniform Memory Access (NUMA), i.e. memory access time depends on the location of physical memory for the process relative to the processor where it is running. For best performance the memory allocated by a process should be close to where it is used.

The “first touch” policy used by the operating system for memory management places the memory allocated by a process in physical memory on or near the processor where it is running. This reduces the time needed for subsequent accesses.

Especially for large arrays, initialize data in an OpenMP loop in the same way you plan to use it later:

!$OMP parallel do              
do i=1,n   
  do j=1,m
    A(j,i) = 0.0  
  enddo 
enddo 

!$OMP parallel do                           
do i=1,n                                
  do j=1,m                           
    dowork(A(j,i))               
  enddo                          
enddo

And remember to set OMP_PROC_BIND and/or other related environment variables appropriately.

Segmentation Fault - Common Runtime Problem with OpenMP Applications

This is the most frequently reported issue regarding OpenMP.

The symptom: Application barely begins to run and it fails with segmentation violation.

The reason: Insufficient stack size

The solution: Increase the stacksize for the whole program and for each thread.

There are two stacksizes to increase: Linux shell and OpenMP.

First try adjusting the Linux shell limit.  The typical OS default is about 8MB for the whole program (shared and local data). To increase this on Linux increase the shell limits. For Bash, use “ulimit -s unlimited”. For csh, use “limit stacksize unlimited.” On Windows, relinking the program with /F:100000000 sets the whole program stack to 100,000,000 bytes, 100MB.

If the program still fails with Segmentation Fault, try increasing the stack size for the data local to each thread by setting the environment variable OMP_STACKSIZE to the number of kilobytes for the private stack. Be aware that this is set for each thread and the memory is actually allocated.

Tips for Debugging OpenMP Applications

First, set the number of threads to 1 (set the environment variable OMP_NUM_THREADS to 1) and run it. If it works, try Intel® Inspector on the threaded code to check for threading errors like race conditions and other synchronization issues.

If it still fails, build with the compiler options: -qopenmp-stubs -auto (/Qopenmp-stubs /auto on Windows), and run it. With -qopenmp-stubs the run-time library (RTL) calls are resolved, but no threaded code is generated. -auto tells the compiler to be sure that local arrays are on the stack, the default for -qopenmp.

If that works, check for missing FIRSTPRIVATE and/or LASTPRIVATE clauses. FIRSTPRIVATE causes the value of the variable inside the parallel region to its last value before the parallel region. On exit from the parallel region, LASTPRIVATE sets the value of the specified variable to its value on exiting the parallel region.

If the program still fails, threaded code generation is not the cause.

If the application runs successfully without -auto, the changed memory model is implicated. Perhaps the stack size is insufficient or possibly values are not preserved between successive calls that should be.

A tip for debugging with PRINT statements, the internal I/O buffers are threadsafe (when the application is compiled with -qopenmp (/Qopenmp on Windows)), but the order of print statements from different threads is undetermined. It is handy to print the thread number with other debug information. Add USE OMP_LIB for OpenMP run-time library routines and call OMP_GET_THREAD_NUM() to get the thread number and print the integer result.

Debug with -O0 -qopenmp. Unlike most other optimizations, OpenMP threading is not disabled at -O0.

Floating-Point Reproducibility and OpenMP

If you run the same executable with different numbers of threads, you may get slightly different answers due to different work decomposition leading to tiny differences in arithmetic rounding. Most cases can be worked around by compiling with -fp-model consistent. “Consistency of Floating-Point Results using the Intel® Compiler” discusses many of the aspects of reproducibility.

Floating point reductions may still not be strictly reproducible in OpenMP, even for the same number of threads. The order in which contributions from different threads are added may vary from run to run. Set the environment variable, KMP_DETERMINISTIC_REDUCTION=true, and use static scheduling with a fixed number of threads. This should give run-to-run reproducibility

Intel Specific Environment Variables

Intel provides a number of environment variables that are Intel Only and recognized by the OpenMP runtime environment. Some of the more useful include:

  • KMP_SETTINGS = 0 | 1         Print environment variables or defaults at runtime
  • KMP_VERSION = off | on       Print runtime library version
  • KMP_LIBRARY = turnaround | throughput | serial
    • turnaround - idle threads, do not yield to other processes
    • throughput - idle threads sleep and yield after KMP_BLOCKTIME msec Thread wait time before going to sleep (default 200 msec)
  • KMP_AFFINITY                       See the Intel Fortran Developer Guide for a full description
  • KMP_CPUINFO_FILE             Use file for machine topology, instead of /proc/cpuinfo on Linux
  • KMP_DETERMINISTIC_REDUCTION         See Floating Point Reproducibility and OpenMP

Components from Intel for Debugging OpenMP Applications

Besides the GNU Project Debugger (GDB) with Intel extensions, these Intel tools can detect runtime issues with your application.

Intel® Advisor

  • guides you in adding parallelism to your application
  • allows experiments to estimate the potential speedup
  • helps make your application thread safe
  • part of Intel Parallel Studio XE

Intel® Inspector

Unified set of tools that pinpoint hard-to-find errors in multi-threaded applications

  • data races
  • deadlocks
  • memory defects
  • security issues
  • performs run-time analysis
  • part of Intel Parallel Studio XE
  • display results via a GUI

Intel® Vtune™ Amplifier

Performs concurrency and locks and waits analyses

Other features

  • View application concurrency level to ensure core utilization
  • Identify where thread and synchronization related overhead impacts performance
  • Identify objects impacting performance
  • Visualize the distribution of work to threads
  • Visualize when threads are active and inactive
  • Supports native threads, Intel® Threading Building Blocks or OpenMP* applications
  • Convenient GUI display
  • Data collection can be local or remote
  • part of Intel Parallel Studio XE

Summary

Intel software tools provide extensive support for threading applications to take advantage of multi-core architectures. Advice and background information are provided for a variety of issues that may arise when threading a Fortran application.

References

Intel Software Products information, evaluations, active user forums, http://software.intel.com

The Intel® Fortran Compiler Developer Guide and Reference and other documents for other tools mentioned are available at Documentation

Intel Advisor

Intel VTune Amplifier

What is Code Modernization

Distributed Memory Coarray Fortran with the Intel Fortran Compiler for Linux Essential

Coarray Tutorial for Windows

Vectorization Cookbook

Automatic Parallelization with Intel Compilers

OpenMP Standard

Instruction Level Parallelism (ILP), Wikipedia, July 30, 2018

Automatic Vectorization, Wikipedia

Amdahl's Law, Wikipedia, July 30, 2018



INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2018. Intel Corporation.

For more complete information about compiler optimizations, see our Optimization Notice.