Parallelism in the Intel® Math Kernel Library (PDF 171KB)

## Abstract

Software libraries provide a simple way to get immediate performance benefits on multicore, multiprocessor, and cluster computing systems. The Intel® Math Kernel Library (Intel® MKL) contains a large collection of functions that can benefit math-intensive applications. This chapter will describe how Intel MKL can help programmers achieve superb serial and parallel performance in common application areas. This material is applicable to IA-32 and Intel® 64 processors on Windows*, Linux*, and Mac OS* X operating systems.

This article is part of the larger series, "Intel Guide for Developing Multithreaded Applications," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.

## Background

Optimal performance on modern multicore and multiprocessor systems is typically attained only when opportunities for parallelism are well exploited and the memory characteristics underlying the architecture are expertly managed. Sequential codes must rely heavily on instruction and register level SIMD parallelism and cache blocking to achieve best performance. Threaded programs must employ advanced blocking strategies to ensure that multiple cores and processors are efficiently used and the parallel tasks evenly distributed. In some instances, out-of-core implementations can be used to deal with large problems that do not fit in memory.

## Advice

One of the easiest ways to add parallelism to a math-intensive application is to use a threaded, optimized library. Not only will this save the programmer a substantial amount of development time, it will also reduce the amount of test and evaluation effort required. Standardized APIs also help to make the resulting code more portable.

Intel MKL provides a comprehensive set of math functions that are optimized and threaded to exploit all the features of the latest Intel® processors. The first time a function from the library is called, a runtime check is performed to identify the hardware on which the program is running. Based on this check, a code path is chosen to maximize use of instruction- and-register level SIMD parallelism and to choose the best cache-blocking strategy. Intel MKL is also designed to be threadsafe, which means that its functions operate correctly when simultaneously called from multiple application threads.

Intel MKL is built using the Intel® C++ and Fortran Compilers and threaded using OpenMP*. Its algorithms are constructed to balance data and tasks for efficient use of multiple cores and processors. The following table shows the math domains that contain threaded functions (this information is based on Intel MKL 10.2 Update 3):

Linear Algebra |
Used in applications from finite-element analysis engineering codes to modern animation |

BLAS (Basic Linear Algebra Subprograms) |
All matrix-matrix operations (level 3) are threaded for both dense and sparse BLAS. Many vector-vector (level 1) and matrix-vector (level 2) operations are threaded for dense matrices in 64-bit programs running on the Intel® 64 architecture. For sparse matrices, all level 2 operations except for the sparse triangular solvers are threaded. |

LAPACK (Linear Algebra Package) |
Several computational routines are threaded from each of the following types of problems: linear equation solvers, orthogonal factorization, singular value decomposition, and symmetric eigenvalue problems. LAPACK also calls BLAS, so even non-threaded functions may run in parallel. |

ScaLAPACK (Scalable LAPACK) |
A distributed-memory parallel version of LAPACK intended for clusters. |

PARDISO |
This parallel direct sparse solver is threaded in its three stages: reordering (optional), factorization, and solve (if solving with multiple right-hand sides). |

Fast Fourier Transforms |
Used for signal processing and applications that range from oil exploration to medical imaging |

Threaded FFTs (Fast Fourier Transforms) |
Threaded with the exception of 1D real and split-complex FFTs. |

Cluster FFTs |
Distributed-memory parallel FFTs intended for clusters. |

Vector Math |
Used in many financial codes |

VML (Vector Math Library) |
Arithmetic, trigonometric, exponential/logarithmic, rounding, etc. |

Because there is some overhead involved in the creation and management of threads, it is not always worthwhile to use multiple threads. Consequently, Intel MKL does not create threads for small problems. The size that is considered small is relative to the domain and function. For level 3 BLAS functions, threading may occur for a dimension as small as 20, whereas level 1 BLAS and VML functions will not thread for vectors much smaller than 1000.

Intel MKL should run on a single thread when called from a threaded region of an application to avoid over-subscription of system resources. For applications that are threaded using OpenMP, this should happen automatically. If other means are used to thread the application, Intel MKL behavior should be set using the controls described below. In cases where the library is used sequentially from multiple threads, Intel MKL may have functionality that can be helpful. As an example, the Vector Statistical Library (VSL) provides a set of vectorized random number generators that are not threaded, but which offer a means of dividing a stream of random numbers among application threads. The SkipAheadStream() function divides a random number stream into separate blocks, one for each thread. The LeapFrogStream() function will divide a stream so that each thread gets a subsequence of the original stream. For example, to divide a stream between two threads, the Leapfrog method would provide numbers with odd indices to one thread and even indices to the other.

## Performance

Figure 1 provides an example of the kind of performance a user could expect from DGEMM, the double precision, general matrix-matrix multiply function included in Intel MKL. This BLAS function plays an important role in the performance of many applications. The graph shows the performance in Gflops for a variety of rectangular sizes. It demonstrates how performance scales across processors (speedups of up to 1.9x on two threads, 3.8x on four threads, and 7.9x on eight threads), as well as achieving nearly 94.3% of peak performance at 96.5 Gflops.

**Figure 1.**Performance and scalability of the BLAS matrix-matrix multiply function.

## Usage Guidelines

Since Intel MKL is threaded using OpenMP, its behavior can be affected by OpenMP controls. For added control over threading behavior, Intel MKL provides a number of service functions that mirror the OpenMP controls. These functions allow the user to control the number of threads the library uses, either as a whole or per domain (i.e., separate controls for BLAS, LAPACK, etc.). One application of these independent controls is the ability to allow nested parallelism. For example, behavior of an application threaded using OpenMP could be set using the OMP_NUM_THREADS environment variable or omp_set_num_threads() function, while Intel MKL threading behavior was set independently using the Intel MKL specific controls: MKL_NUM_THREADS or mkl_set_num_threads() as appropriate. Finally, for those who must always run Intel MKL functions on a single thread, a sequential library is provided that is free of all dependencies on the threading runtime.

Intel® Hyper-Threading Technology is most effective when each thread performs different types of operations and there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria, because the threaded portions of the library execute at high efficiency using most of the available resources and perform identical operations on each thread. Because of that, Intel MKL will by default use only as many threads as there are physical cores.

## Additional Resources

Parallel Programming Community

Intel® Math Kernel Library

Netlib: Information about BLAS, LAPACK, and ScaLAPACK

## Comments

thank you

I am puzzled by an apparent contradiction between the graph and the following comments from the last paragraph:

Based on the above quote, one would expect that hyper-threading will not improve performance. However, the graph for a 4-Core processor with 8 hyper-threads shows doubling of performance when going from 4 physical threads to 8 hyperthreads. Please clarify the apparent contradiction between the results shown in the graph and the above quote.

Govind,

Hyperthreading do improve the performance depends on the application, not the same way as an application uses available physical cores with dedicated resources than the logical cpus with duplicated registers in case of HT. In this case, it was a dual socket machine with 8 threads. We are replacing this graph with a better one that has more detailed info of the environment.

--Vipin