Threading Choices for Your Intel IPP Application
Source code for some multi-threaded IPP application examples are included in the free sample downloads. Several of these examples implement threading at the application level, and some use the OpenMP* threading that is built into the Intel IPP library. In most cases the performance gains due to multi-threading is substantial.
The quickest way to multi-thread an IPP application is to use the built-in OpenMP threading of the library. There’s no significant code rework required on your part and, depending on the IPP primitives you use, it can provide additional performance improvements.
Or, you can thread your application and call the primitives simultaneously from your application threads. This gives you more control over threading, since you can tune the threading to meet the needs of your application.
When you write a multi-threaded IPP application we generally recommend that you disable the library’s built-in threading (see part 3 of this series). Doing so eliminates competition between the library’s OpenMP threading and your application’s threading, and avoids oversubscription of software threads to the available hardware threads.
Disabling internal library threading in a multi-threaded application is not a hard and fast rule. For example, if your application has just two threads (e.g., a GUI thread and a background thread) and the IPP library is only being used by the background thread, using the internal IPP threading probably makes sense.
One way to see if you are succumbing to oversubscription problems is to build your application using the Inspector and Amplifier tools that are part of Intel Parallel Studio. These tools will help you identify multi-threading errors and performance bottlenecks.
Memory and Cache Alignment
If you work with large blocks of data, you might expect throughput to be impacted by improperly aligned data. The IPP library includes memory allocation and alignment functions to address this issue. And, most compilers can pad structures for bus efficient alignment of your data.
What may not be obvious is the importance of cache alignment and spacing of data relative to cache lines when implementing parallel threads. If the operations of parallel threads frequently use coincident or shared data structures the write operations of one thread could be invalidating the cache lines associated with the data structures of a parallel thread.
When you use data decomposition to build parallel threads of identical IPP operations be sure to consider the relative spacing of the decomposed data blocks being operated on by the parallel threads and the spacing of any control data structures used by the primitives within those threads. Especially if the control data structures hold state information that is updated with each iteration of the IPP function. If these data structures are “too close” (because they share a cache line) an update to one data structure will invalidate the neighboring data structure that is being operated on in parallel!!
A simple way to avoid this problem is to allocate your data structures so they occupy cache line multiples (typically 64 bytes). Any wasted bytes used to pad your data structures in this way will more than make up for the lost bus cycles required to refresh a cache line on each iteration of the parallel loops.
Pipelined Processing ala DMIP
In the ideal world your application would adapt at run-time to optimize its use of the SIMD instructions available, the number of hardware threads present, and the size of the high-speed cache. Given optimum use of these three key resources you might achieve near perfect parallel operation of your application! This is essentially the aim behind the DMIP library that is part of the Intel IPP library.
The DMIP approach to parallelization is a data decomposition method where parallel sequences of Intel IPP primitives are executed on cache-optimal sized data blocks. With the right data sets, this enables application performance gains of several factors over operating sequentially over an entire data set.
For example, rather than operate over an entire image with sequential filtering functions, DMIP breaks the image into cacheable segments, or tiles, and performs multiple operations on each segment, while it remains in the cache. The sequence of operations is a calculation pipeline and is applied to each tile until the entire data set is processed. Multiple pipelines running in parallel can then be constructed to amplify the performance.
The following graph shows some performance gains possible using the DMIP parallel processing technique. In the graph, lower bars represent faster performance. (“CPE” represents cycles per element, which measures the number of clock cycles required per element of operation – fewer clocks equals less time to execute.)
YMMV! (Your Mileage May Vary!) The performance improvement you see is a function of which primitives you use, where they are used in your application, how often they are used, how your program is structured, the type of data you operate on, the processor you use, etc., etc., etc… See Benchmark Limitations for the legalese. :-)
The Intel® Integrated Performance Primitives (the Intel® IPP library) is a collection of highly optimized functions for frequently-used fundamental algorithms found in a variety of domains including signal processing, image/audio/video encode/decode, data compression, string processing, and encryption. The library takes advantage of the extensive SIMD (single instruction multiple data) instructions and multiple hardware execution threads available in modern Intel processors. These instructions are ideal for optimizing algorithms that operate on arrays and vectors of data.
The IPP library is available for use with applications built for the Windows, Linux, Mac OS X, and QNX operating systems and is compatible with the Intel C and Fortran Compilers, the Microsoft Visual Studio C/C++ compilers, and the gcc compilers found in most Linux distributions. The library is validated for use with multiple generations of Intel and compatible AMD* processors, including the Intel® Core™ and Intel® Atom™ processors. Both 32-bit and 64-bit operating systems and architectures are supported.
The Intel® IPP library is available as a standalone product or as a component in the Intel® Professional Edition compilers andIntel® Parallel Studio. Parallel Studio brings comprehensive parallelism to C/C++ Microsoft Visual Studio* application development and was created to ease the development of parallelism in your applications. Parallel Studio is interoperable with common parallel programming libraries and API standards, such as Intel® Threading Building Blocks (Intel® TBB) and OpenMP*, and provides an immediate opportunity to realize the benefits of multicore platforms.
* Other names and brands may be claimed as the property of others.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804