por Noah Clemons, Jing Xu
Publicado:11/14/2014 Última atualização:10/26/2017
Employing performance libraries can be a great way to streamline and unify the computational execution flow for data intensive tasks, thus minimizing the risk of data stream timing issues and heisenbugs. Here we will describe the two libraries that can be used for signal processing within Intel® System Studio.
Performance libraries such as the Intel® Integrated Performance Primitives (Intel® IPP) contain highly optimized algorithms and code for common functions including as signal processing, image processing, video/audio encode/decode, cryptography, data compression, speech coding, and computer vision. Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored for certain applications. One calls Intel IPP as if it is a black box pocket of computation for their low-power or embedded device–‘in’ flows the data and ‘out’ receives the result. In this fashion, using the Intel IPP can take the place of many processing units created for specific computational tasks. Intel IPP excels in a wide variety of domains where intelligent systems are utilized.
Without the benefit of highly optimized performance libraries, developers would need to optimize computationally intensive functions themselves carefully to obtain adequate performance. This optimization process is complicated, time consuming, and must be updated with each new processor generation. Intelligent systems often have a long lifetime in the field and there is a high maintenance effort to hand-optimize functions.
Signal processing and advanced vector math are the two function domains that are most in demand across the different types of intelligent systems. Frequently, a digital signal processor (DSP) is employed to assist the general purpose processor with these types of computational tasks. A DSP may come with its own well-defined application interface and library function set. However, it is usually poorly suited for general purpose tasks; DSPs are designed to quickly execute basic mathematical operations (add, subtract, multiply, and divide). The DSP repertoire includes a set of very fast multiply and accumulate (MAC) instructions to address matrix math evaluations that appear frequently in convolution, dot product and other multi-operand math operations. The MAC instructions that comprise much of the code in a DSP application are the equivalent of SIMD instruction sets. Like the MAC instructions on a DSP, these instruction sets perform mathematical operations very efficiently on vectors and arrays of data. Unlike a DSP, the Single Instruction Multiple Data (SIMD) instructions are easier to integrate into applications using complex vector and array mathematical algorithms since all computations execute on the same processor and are part of a unified logical execution stream.
For example, an algorithm that changes image brightness by adding (or subtracting) a constant value to each pixel of that image must read the RGB values from memory, add (or subtract) the offset, and write the new pixel values back to memory. When using a DSP coprocessor, that image data must be packaged for the DSP (placed in a memory area that is accessible by the DSP), signaled to execute the transformation algorithm, and finally returned to the general-purpose processor. Using a general-purpose processor with SIMD instructions simplifies this process of packaging, signaling, and returning the data set. Intel IPP primitives are optimized to match each SIMD instruction set architecture so that multiple versions of each primitive exist in the library.
Intel IPP can be reused over a wide range of Intel® architecture-based processors, and due to automatic dispatching, the developer’s code base will always pick the execution flow optimized for the architecture in question without having to change the underlying function call (Figure 2). This is especially helpful if an embedded system employs both an Intel® Core™ processor for data analysis/aggregation as well as a series of Intel Atom® Processor based SoCs for data pre-processing/collection. In that scenario, the same code base may be used in part on both the Intel® Atom™ Processor based SoCs in the field and the Intel® Core™ processor in the central data aggregation point.
With specialized SoC components for data streaming and I/O handling combined with a limited user interface, one may think that there are not a lot of opportunities to take advantage of optimizations and/or parallelism, but that is not the case. There is room for
Both concepts often coexist in the same SoC. Code with failsafe real-time requirements is protected within its own wrapper managed by a modified round-robin real-time scheduler, while the rest of the operating system (OS) and application layers are managed using standard SMP multi-processing concepts. Intel Atom Processors contain two Intel Hyper-Threading Technology based cores and may contain an additional two physical cores resulting in a quad-core system. In addition Intel Atom Processors support the Intel SSSE3 instruction set. A wide variety of Intel IPP functions found in Intel Atom® Processors support in the Intel® Integrated Performance Primitives (Intel® IPP) Library are tuned to take advantage of Intel Atom Processor architecture specifics as well as Intel SSSE3.
Figure 1: Intel IPP is tuned to take advantage of the Intel Atom Processor and the Intel SSSE3 instruction set
Throughput intensive applications can benefit from the use of use of Intel SSSE3 vector instructions and parallel execution of multiple data streams through the use of extra-wide vector registers for SIMD processing. As just mentioned, modern Intel Atom Processor designs provide up to four virtual processor cores. This fact makes threading an interesting proposition. While there is no universal threading solution that is best for all applications, the Intel IPP has been designed to be thread-safe.
Intel IPP provides flexibility in linkage models to strike the right balance between portability and footprint management.
Standard Dynamic | Custom Dynamic | Dispatched Static | Non-dispatched Static | |
---|---|---|---|---|
Optimizations | All SIMD sets | All SIMD sets | All SIMD sets | Single SIMD set |
Distribution | Executable(s) and standard Intel IPP DLLs | Executable(s) and custom DLLs | Executable(s) only | Executable(s) only |
Library Updates | Redistribute as-is | Rebuild and redistribute | Recompile application and redistribute | Rebuild custom library, recompile application, and redistribute |
Executable Only Size | Small | Small | Large | Medium |
Total Binary Size | Large | Medium | Medium | Small |
Kernel Mode | No | No | Yes | Yes |
Table 1: Intel IPP Linkage Model Comparison
The standard dynamic and dispatched static models are the simplest options to use in building applications with the Intel IPP. The standard dynamic library includes the full set of processor optimizations and provides the benefit of runtime code sharing between multiple Intel IPP-based applications. Detection of the runtime processor and dispatching to the appropriate optimization layer is automatic.
If the number of Intel IPP functions used in your application is small, and the standard shared library objects are too large, using a custom dynamic library may be an alternative.
To optimize for minimal total binary footprint, linking against a non-dispatched static version of the library may be the approach to take. This approach yields an executable containing only the optimization layer required for your target processor. This model achieves the smallest footprint at the expense of restricting your optimization to one specific processor type and one SIMD instruction set. This linkage model is useful when a self-contained application running on only one processor type is the intended goal. It is also the recommended linkage model for use in kernel mode (ring 0) or device driver applications.
Intel IPP addresses both the needs of the native application developer found in the personal computing world and the intelligent system developer who must satisfy system requirements with the interaction between the application layer and the software stack underneath. By taking the Intel IPP into the world of middleware, drivers and OS interaction, it can be used for embedded devices. The limited dependency on OS libraries and its support for flexible linkage models makes it simple to add to embedded cross-build environments with popular GNU* based cross-build setups like Poky-Linux* or MADDE*.
Developing for intelligent systems and small form factor devices frequently means that native development is not a feasible option. Intel IPP can be easily integrated with a cross-build environment and be used with cross-build toolchains that accommodate the flow requirements of many of these real-time systems. Use of the Intel IPP allows embedded intelligent systems to take advantage of vector instructions and extra-wide vector registers on the Intel Atom Processor. Developers can also meet determinism requirements without increasing the risks associated with cross-architecture data handshakes of complex SoC architectures.
Developing for embedded small form factor devices also means that applications with deterministic execution flow requirements have to interface more directly with the system software layer and the OS scheduler. Software development utilities and libraries for this space need to be able to work with the various layers of the software stack, whether it is the end-user application or the driver that assists with a particular data stream or I/O interface. The Intel IPP has minimal OS dependencies and a well-defined ABI to work with the various modes. One can apply highly optimized functions for embedded signal and multimedia processing across the platform software stack while taking advantage of the underlying application processor architecture and its strengths, all without redesigning and returning the critical functions with successive hardware platform upgrades.
Intel MKL includes routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. Intel MKL includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions.
The Intel® Math Kernel Library includes the following groups of routines:
For more detailed information about choosing IPP or MKL for FFT, please visit Intel® MKL and Intel® IPP: Choosing a High Performance FFT.
Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:
O desempenho varia de acordo com o uso, a configuração e outros fatores. Saiba mais em https://edc.intel.com/content/www/br/pt/products/performance/benchmarks/overview/.