OpenMP and the Intel® IPP Library

Introduction

The low-level primitives within the Intel IPP library generally represent basic atomic operations. This limits threading within the library to ~15-20% of the functions. OpenMP is enabled by default when you use one of the multi-threaded variants of the Intel IPP library. A list of the threaded primitives in the IPP library is provided in the ThreadedFunctionsList.txt file located in the library’s doc directory.

The quickest way to multi-thread an Intel IPP application is to use the built-in OpenMP threading of the library. There’s no significant code rework required on your part and, depending on the IPP primitives you use, it may provide additional performance improvements.

IPP Internal Threading Model

IPP Internal Threading Model

If you use multiple threads in your own application (above the Intel IPP library) we generally recommend that you disable the library’s built-in threading. Doing so eliminates competition between the library’s OpenMP threading and your application’s threading, and avoids oversubscription of software threads to the available hardware threads.

Disabling internal IPP library threading in a multi-threaded application is not a hard and fast rule. For example, if your application has just two threads (e.g., a GUI thread and a background thread) and the IPP library is only being used by the background thread, using the internal IPP threading probably makes sense.

For a quick summary of the differences between OpenMP and other threading technologies please read Intel® Threading Building Blocks, OpenMP, or native threads?

Controlling OpenMP Threading in the Intel IPP Primitives

The default maximum number of OpenMP threads used by the multi-threaded IPP primitives is equal to the number of hardware threads in the system, which is determined by the number and type of CPUs in your system. That means that a quad-core processor with Intel® HT has eight hardware threads (four cores, each core has two threads), and a dual-core CPU without Intel HT has only two hardware threads.

There are two IPP primitives for control and status of the OpenMP threading used within the library: ippSetNumThreads() and ippGetNumThreads(). You call ippGetNumThreads to determine the current thread cap and ippSetNumThreads to change the thread cap. ippSetNumThreads will not allow you to set the thread cap beyond the number of available hardware threads. This thread cap is an upper bound on the number of threads that can be used within a multi-threaded primitive. Some IPP functions may use fewer threads than specified by the thread cap, but they will never use more than the thread cap.

To disable OpenMP threading within the library you need to call ippSetNumThreads(1) near the beginning of your application. Or, you can link your application with the single-threaded variant of the library.

The OpenMP library used by the IPP library references several configuration environment variables. In particular, OMP_NUM_THREADS sets the default number of threads (the thread cap) to be used by the OpenMP library at run time. However, the IPP library will override this setting by limiting the number of OpenMP threads used by your application to be either the number of hardware threads in the system, as described above, or the value specified by a call to ippSetNumThreads, whichever is smaller. OpenMP applications on your system that do not use the Intel IPP library might still be affected by the OMP_NUM_THREADS environment variable; likewise, any such OpenMP applications will not be affected by a call to the ippSetNumThreads function within your Intel IPP application.

Nested OpenMP

If your application that is using the Intel IPP library also implements multi-threading via OpenMP, the threaded Intel IPP primitives your application calls may execute as single-threaded primitives. This happens when an IPP primitive is called within an OpenMP parallelized section of code and if nested parallelization has been disabled, which is the default case for the Intel OpenMP library.

By nesting parallel OpenMP regions you risk creating a large number of threads that can effectively oversubscribe the number of hardware threads available. Creating parallel region always incurs overhead, and the overhead associated with nesting parallel OpenMP regions may outweigh the benefit.

In general, OpenMP threaded applications that use the IPP library should disable multi-threading within the library, either by calling ippSetNumThreads(1) or by using the single-threaded static Intel IPP library.

Core Affinity

Some of the Intel IPP primitives in the signal processing domain are designed to execute parallel threads that exploit a merged L2 cache. These functions (single and double precision FFT, Div, Sqrt, etc.) need a shared cache in order to achieve their maximum multi-threaded performance. In other words, the threads within these primitives should, ideally, execute on CPU cores located on a single die with a shared or unified cache. To insure this condition is met, the following OpenMP environment variable should be set before an application using the Intel IPP library runs:

KMP_AFFINITY=compact

On processors with two or more cores on a single die, this condition is satisfied automatically and the environment variable is superfluous. However, for those systems with more than two dies (e.g., a Pentium D or a multi-socket motherboard), where the cache serving each die is not shared, failing to set this OpenMP environmental variable can actually result in performance degradation for this class of multi-threaded Intel IPP primitives.

Additionally, some IPP functions require that Intel Hyper-Threading Technology is disabled or not used by the multiple threads within the Intel IPP mult-threaded library. This has been seen to negatively impact, for example, the performance of the IPP cryptography sample based on OpenSSL. In this case you should follow the instructions in this KB article:

IPP Crypto Sample Performance for OpenSSL too Slow on Hyper-Threading Systems

Multi-threaded FFT Functions

The multi-threaded FFT functions were originally developed as part of the v8/u8 libraries (Core 2 with a shared cache architecture). These functions specifically exploit a shared-cache architecture in order to achieve higher performance in a multi-threaded environment. If this shared-cache condition is not met you may see a performance degradation.

As noted above, for processors that use libraries higher than the v8/u8 optimization (e.g., p8/y8), you must set the  KMP_AFFINITY environmental variable equal to "compact" (as shown above) to avoid this potential performance degradation.

Within the FFT functions, threading starts with and order of 12 for 64fc data types and an order of 13 for 32fc data types.

Optimization Notice in English

Tags:
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.