Threading Fortran applications for parallel performance on multi-core systems

Threading Fortran applications for parallel performance on multi-core systems


Most processors now come with multiple cores, and future increases in performance are expected to come mostly from increases in core count. Performance sensitive applications that neglect the opportunities presented by additional cores will soon be left behind. This article discusses ways for an existing, serial Fortran application to take advantage of these opportunities on a single, shared memory system with multiple cores. Issues addressed include data layout, thread safety, performance and debugging. Intel provides software tools that can help in the development of robust, parallel applications that scale well.


Levels of Parallelism

1 SIMD instructions
   – Compiler can vectorize loops automatically
2 Instruction level
   – Processor schedules, you don’t see it
3 Threading (usually shared memory)
   – Native Linux* or Windows* threads
   – OpenMP*
   – Simplest for multi-core
4 Distributed memory clusters
   – Message passing (various MPI);
   – Coarray Fortran (part of Fortran 2008)
5 “embarassingly parallel” multiprocessing

Ways to Introduce Threading

1 Threaded libraries, e.g. Intel® MKL
   – Easy and effective, if it fits your problem
2 Auto-parallelization by the compiler
   – Easy to do, but limited in scope
   – “simple” loops where compiler can prove it is safe
3 Asynchronous I/O (very specialized; see compiler documentation)
4 Native threads
   – Mostly used for task-level parallelism
   – Not so easy to program and debug
5 OpenMP
   – Designed to facilitate data-level parallelism
   – (relatively) easier programming and debugging
   – Some support for task parallelism, especially in OpenMP 3.0
   – portable
   – OpenMP 4.0 RC2 and the latest Intel compiler also support SIMD parallelism using OpenMP directives


Intel® Math Kernel Library

1 Many components of MKL have threaded versions
   – Based on the compiler’s OpenMP runtime library
   – Level 1,2 and 3 BLAS, LAPACK
   – Sparse BLAS
   – Discrete Fourier Transforms
   – Vector math and random number functions
   – Direct sparse solvers, e.g. PARDISO
2 Link threaded or non-threaded interface
   – libmkl_intel_thread.a or libmkl_sequential.a
   – Use the link line advisor at
      http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor 
3 Set the number of threads
   – export MKL_NUM_THREADS or OMP_NUM_THREADS
   – Call mkl_set_num_threads or omp_set_num_threads

Example: PARDISO (Parallel Direct Sparse Solver)

1 Solver for large, sparse symmetric and antisymmetric systems of linear equations on shared memory systems
   – Main solver is threaded using OpenMP
      – Just link to threading layer, libmkl_intel_thread
      – iparm(2)=3 will add threading for initial reordering phase
   – Good scaling for large problems
   – Fortran90, Fortran77 and C interfaces
      – F90 interface can catch many errors in calling sequences
   – Supports real, complex, single and double precision
   – Iterative refinement
   – Use MKL_NUM_THREADS or OMP_NUM_THREADS to control number of threads
      – Else defaults to number of processors (including hyperthreads)
2 For algorithms, see http://www.pardiso-project.org

Auto-parallelization

1 The compiler can thread simple loops automatically
   – Build with –parallel (Linux*) or /Qparallel (Windows*)
   – Need at least –O2 (cf OpenMP works at –O0)
   – Loops must fulfill “simple” conditions
   – reports which loops parallelized, & if not, why not
      – -par-report2 etc
   – tune parallelization cost model with -par-thresholdn
      – default is n=100; try n=99
2 Based on the same RTL threading calls as OpenMP:
   – call _kmpc_fork_call
   – These are wrappers around the low-level pthreads and Win32 threads libraries
   – Recognizes same OpenMP environment variables

Conditions for Auto-parallelization

1 Loop count known at entry (no DO WHILE)
   – But not necessarily at compile time
   – No jumps into or out of loop
2 Loop iterations are independent
   – No function calls (or prove no side effects)
      – except if inlined
   – No aliasing (accessing same variable via different pointers)
   – No structures such as X(I+1) = Y(I+1) + X(I)
   – Reductions are allowed
      – But partial sums may lead to rounding differences
3 Enough work to amortize parallel overhead
4 Conditions for OpenMP loops are similar
   – But you are responsible for dependencies, not compiler!!
5 Directives may be used to guide the compiler:
   – !DIR$ PARALLEL 
      - asserts no loop-carried dependencies
   – !DIR$ PARALLEL ALWAYS
      - Overrides cost model; threads loop even if compiler doesn’t think performance will improve (like –par-threshold0 for a single loop)
   – !DIR$ LOOP COUNT
      - estimate of typical number of iterations

Example: matrix multiply

subroutine matmul(a,b,c,n)
real(8) a(n,n),b(n,n),c(n,n)
c=0.d0
do i=1,n         ! Outer loop is parallelized.
   do j=1,n      ! inner loops are interchanged
      do k=1,n  ! new inner loop is vectorized
         c(j,i)=c(j,i)+a(k,i)*b(j,k)
      enddo
   enddo
enddo
end

$ ifort -O3 -parallel -par-report1 -c matmul.f90
matmul.f90(4) : (col. 0) remark: LOOP WAS AUTO-PARALLELIZED.

See also   http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers


OPENMP - advantages

1 Standardized API based on compiler directives
   – Latest version is 3.1,  but the draft version 4.0 is already at RC2
   – C++ and Fortran; Linux*, Windows* and Mac OS* X
   – Directives treated as comments by non-OpenMP compiler
   – Single source base for serial and parallel implementations
      – Helps debugging
   – Allows incremental parallelism
   – OpenMP rules make checking tools a little easier

OpenMP Programming Model

Fork-Join Parallelism:

1 Master thread spawns a team of threads as needed
   - Parallelism is added incrementally
      - the sequential program evolves into a parallel program.
   - Threads are not destroyed, but are returned to a pool.

Note that Intel’s implementation of OpenMP creates a separate monitor thread in addition to any user threads.


OPENMP – where to thread

1 Start by mapping out high level structure
2 Where does your program spend the most time?
   – If you don’t know, do quick performance analysis
      – Intel(R) VTune Amplifier XE,  gprof, …
   – If only x% of your program is parallel, the speedup is always less than x%, however many cores and threads.
3 Prefer data parallelism
   – Easier to load balance
   – Easier to scale to more cores
4 Favor coarse grain (high level) parallelism
   – E.g. outer loop of nest, slowest varying grid coordinate, high level driver routines
   – Reduces overhead
   – Improves data locality and reuse for each thread
   – Can’t parallelize iterative loops such as time stepping.

Example: Square_Charge

1 calculates the electrostatic potential at a series of points in a plane
   due to a uniform square distribution of charge
   – Essentially, a 2D integral over the square

Square_charge loops over points
   Twod_int integrate over y
      Trap_int integrate over x
          Func calculates 1/r potential

   – Inline func()
   – Vectorize loop over x
   – Thread loop over y
      – Avoid race conditions
   – Could instead thread loop over points, or use MPI

Openmp: How do threads interact?

1 OpenMP is a shared memory model
   – Threads communicate by sharing variables
2 Unintended sharing of data causes race conditions:
   – Race condition: when the program outcome changes as the threads are scheduled differently
3 To control race conditions…
   – Use synchronization to protect data conflicts
4 Synchronization is expensive so…
   – Change how data is accessed to minimize the need for synchronization

OPENMP – data

1 Identify which data are shared between threads, which need a separate copy for each thread

2 It’s helpful (but not required) to make shared data explicitly global in Modules or common blocks,
   thread private data as local and automatic.

3 Dynamic allocation is OK (malloc, ALLOCATE)
   – Allocate in serial region if shared
   – Allocate in parallel region if each thread needs own copy
      - Or declare as THREADPRIVATE

4 Each thread gets its own private stack, but the heap is shared by all threads
– So making heap objects threadsafe requires locks, which is more expensive

OPENMP – data scoping

1 Distinguish lexically explicit parallel regions from the “dynamic extent” (Functions or subroutines called from within an explicit parallel region. These might contain no OpenMP directives or only “orphaned” OpenMP directives)

2 Lexically explicit: !$OMP PARALLEL to !$OMP END PARALLEL
   – Everything defaults to SHARED (except loop index)
   – Local data: can change with PRIVATE clause
   – Global data: can declare common blocks, module variables as !$OMP THREADPRIVATE
   – Initial values inside parallel region undefined
      – unless use FIRSTPRIVATE (local variables)
      – or COPYIN (globals)
   – Final values after parallel region undefined
      – unless use LASTPRIVATE (local variables)
   – Functions called (the dynamic extent) must be threadsafe, even if they themselves do not contain explicit parallel regions

Thread Safety

1 A threadsafe function can be called simultaneously from multiple threads, and still give correct results
   – Potential conflicts must either be protected (synchronized) or avoided (by privatization)
   – Static local data: by default, each thread may access the same data location! Potentially unsafe.
   – Automatic data: each thread has its own, independent copy on its own stack

2 ifort serial defaults:
   – Local scalars are automatic
   – Local arrays are static 
3 When compiling with –openmp, default changes
   – Local arrays are automatic
   – Same as compiling with –auto
   – This may increase the required stack size
      – Beware segmentation faults

Making a function thread safe

1 With the compiler
   – Compile with –openmp or
   – Compile with –auto
      – May have less impact on serial optimization
2 In source code
   – Use AUTOMATIC keyword in declarations
      – But doesn’t cover compiler-generated temporaries
   – Declare function as RECURSIVE
      – Covers whole function, including compiler-generated code
      – Best way to make code thread safe if you don’t want to depend on build options
3 In either case:
   – don’t use –save or SAVE keyword
   – Avoid global variables,
      – Or don’t write to them unless synchronized

4 OpenMP has various synchronization constructs to protect operations that are potentially unsafe
   – REDUCTION clause
   – !$OMP CRITICAL
   – !$OMP SINGLE
   – etc

Thread Safe Libraries

1 The Intel® Math Kernel library is threadsafe
   – Sequential version as well as threaded version

2 The Intel Fortran runtime library has two versions
   – The default is not threadsafe (libifcore)
      – Build with –threads to link to threadsafe version (libifcoremt)
   – If you linkwith –openmp, the threadsafe version is linked by default
      – But you must compile the Fortran main program with -openmp, to ensure threadsafe initialization of the library)
      – If main program is in C, must call FOR_RTL_INIT

Performance considerations

1 Start with optimized serial code, vectorized inner loops, etc. (e.g., -O3 –ipo …)
2 Ensure sufficient parallel work
3 Minimize data sharing between threads
   – Unless read-only
4 Avoid false sharing of cache lines

!$OMP parallel do
  do i=1,nthreads
    do j=1,1000
      A(i,j) = A(i,j) + ..

   – Each thread thinks it’s copy of A(i,j) may have been invalidated
   – Reversing subscripts of A improves data locality for each thread
      – Contiguous memory access also permits vectorization of inner loop
      – Helps performance a lot
5 Scheduling options
– Consider DYNAMIC or GUIDED if work is unevenly distributed between loop iterations
      – else static (default) has lowest overhead


Timers for threaded apps

1 The Fortran standard timer CPU_TIME returns “processor time”
   – It sums time over all threads/cores
   – Like the “User time” in the Linux “time” command
   – So it might appear that threaded apps don’t run faster than serial ones
2 The Fortran intrinsic subroutine SYSTEM_CLOCK returns data from the real time clock
   – Does not sum over cores
   – Like the “real time” in the Linux “time” command
      – Can be used to measure the speedup due to threading
            Call system_clock(count1, count_rate)
            ……
            Call system_clock(count2, count_rate)
            Time = (count2 - count1) / count_rate

3 dclock (Intel-specific function) can also be used

Thread Affinity Interface

1 Allows OpenMP threads to be bound to physical or logical cores
   – export environment variable KMP_AFFINITY=...
      – physical use all physical cores before assigning logical cores (hyperthreads)
      – compact assign threads to consecutive cores on same socket (eg to benefit from shared cache)
      – scatter assign threads to cores on alternating sockets (eg to maximize channels to memory)
   – Helps optimize access to memory or cache
   – Particularly important if Hyperthreading is enabled
      – else some physical cores may be idle while others run multiple threads

   – See compiler documentation for much more detail

NUMA considerations

1 Want memory allocated “close” to where it will be used
   – “first touch” determines where allocated
   – So initialize data in an OpenMP loop in the same way you plan to use it later:

      !$OMP parallel do                      !$OMP parallel do
        do i=1,n                                   do i=1,n
          do j=1,m                                  do j=1,m
            A(j,i) = 0.0                               dowork(A(j,i))
          enddo                                      enddo
        enddo                                      enddo

2 Remember to set KMP_AFFINITY


Common problems

1 Insufficient stack size
   – Most frequently reported OpenMP issue!
   – Typical symptom: seg fault during initialization
2 For whole program (shared + local data):
   – Increase shell limits to large value
      – (address space, memory allocated only if needed)
   – Bash : ulimit –s [value in KB] or [unlimited]
      – Can only increase once!
   – C shell: limit stacksize 1000000 ( for 1 GB)
   – Windows*: /F:100000000 (value in bytes)
   – Typical OS defaults ~10 MB
3 For individual thread (thread local data only)
   – export OMP_STACKSIZE=[size], default 4m (4 MB) 
   – only needed for automatic data for which each thread has a private copy 
   – Actually allocates memory, so don’t make too large

Tips for Debugging OpenMP apps

1 Run with OMP_NUM_THREADS=1
   – Generate threaded code, but run with a single thread
   – If it works, try Intel(R) Inspector XE on threaded code
   – If still fails, excludes race conditions or other synchronization issues as cause
2 Build with -openmp-stubs -auto
    – RTL calls are resolved; but no threaded code is generated
    – allocate local arrays on the stack, as for OpenMP
    – If works, check for missing FIRSTPRIVATE, LASTPRIVATE
    – If still fails, eliminates threaded code generation as cause
3 If works without –auto, implicates changed memory model
   – Perhaps insufficient stack size
   – Perhaps values not preserved between successive calls
4 If debugging with PRINT statements
   – the internal I/O buffers are threadsafe (with –openmp), but the order of print statements from different threads is undetermined.
   – print out the thread number with omp_get_thread_num()
   – remember to USE OMP_LIB (declares this integer!)
5 Debug with –O0 –openmp
   – Unlike most other optimizations, OpenMP threading is not disabled at –O0

Floating-Point Reproducibility

1 Runs of the same executable with different numbers of threads may give slightly different answers
   – due to different work decomposition leading to tiny differences in rounding
   – Most cases worked around by -fp-model precise
   – See “Consistency of Floating-Point Results using the Intel® Compiler”, 
      /en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler
2 Floating point reductions may still not be strictly reproducible in OpenMP, even for same number of threads
   – the order in which contributions from different threads are added may vary from run to run
   – -fp-model precise doesn’t help here
   – set the environment variable  KMP_DETERMINISTIC_REDUCTION=false . 
        – in conjunction with static scheduling and a fixed number of threads 
        – this should give run-to-run reproducibility

Intel-specific Environment Variables

1 KMP_SETTINGS = 0 | 1
   – Print environment variables or defaults at runtime
2 KMP_VERSION = off | on
   – Print runtime library version
3 KMP_LIBRARY = turnaround | throughput | serial
   – turnaround idle threads do not yield to other processes
–    throughput idle threads sleep and yield after KMP_BLOCKTIME msec
4 KMP_BLOCKTIME
   – Thread wait time before going to sleep (default 200 msec)
5 KMP_AFFINITY (See main documentation for full API)
6 KMP_MONITOR_STACKSIZE
   – Sets stack allocation for the monitor thread
7 KMP_CPUINFO_FILE
   – Use file for machine topology (e.g. instead of /proc/cpuinfo on Linux)
8 KMP_DETERMINISTIC_REDUCTION
   – See above

Tools for Debugging OpenMP apps

1 Static source checking (requires the compiler and Intel(R) Inspector XE)
   – ifort –openmp –diag-enable sc-parallel3
   – Generates error and warning diagnostics for some possible API violations, including
      – Dependencies or data races,
         – e.g. due to missing PRIVATE or REDUCTION declarations
      – Inconsistent parallel regions, including dynamic extents
   – Can analyze across source file boundaries
2 Intel Parallel Debugger, idb (Linux) and Intel Parallel Debugger Extension (on Windows)
   – Thread groups; lock-stepping; thread-specific break points
   – On-the-fly serialization; shared data access detection
   – OpenMP windows for threads, tasks, barriers, locks, …

Intel® Inspector XE

1 Unified set of tools that pinpoint hard-to-find errors in multi-threaded applications
   – data races
   – deadlocks
   – memory defects
   – security issues
   – performs run-time analysis
   – displays results of compiler-based static analysis
   – part of Intel(R) Parallel Studio XE

2 Display results via a GUI

Intel® VTune Amplifier XE - Concurrency and Locks & Waits analyses

1 Features & Benefits
   – View application concurrency level to ensure core utilization
   – Identify where thread and synchronization related overhead impacts performance
   – Identify objects impacting performance
   – Visualize the distribution of work to threads
   – Visualize when threads are active and inactive
   – Supports native threads , Intel® Threading Building Blocks or OpenMP* applications 
   – Convenient GUI display.
   – data collection can be local or remote
   – part of Intel(R) Parallel Studio XE

Intel Advisor XE

1 Features & Benefits
   –  guides you in adding parallelism to your application
   – allows experiments to estimate the potential speedup
   – helps make your application threadsafe
   – part of Intel(R) Parallel Studio XE



Summary

Intel software tools provide extensive support for threading applications to take advantage of multi-core architectures.
Advice and background information are provided for a variety of issues that may arise when threading a Fortran application.

References

1 Intel Software Products information, evaluations, active user forums, see  http://www.intel.com/software/products
2 Parallelism tips at http://www.devx.com/go-parallel
3 Developing Multithreaded Applications: A Platform Consistent Approach http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-lin/index.htm 
4 The Intel® Fortran Compiler User and Reference Guides, available online via http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-lin/index.htm 



INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2013. Intel Corporation.

For more complete information about compiler optimizations, see our Optimization Notice.