Employing performance libraries can be a great way to streamline and unify the computational execution flow for data intensive tasks, thus minimizing the risk of data stream timing issues and heisenbugs. Here we will describe the two libraries that can be used for signal processing within Intel® System Studio.
Intel® Integrated Performance Primitives (Intel®IPP)
Performance libraries such as the Intel IPP contain highly optimized algorithms and code for common functions including as signal processing, image processing, video/audio encode/decode, cryptography, data compression, speech coding, and computer vision. Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored for certain applications. One calls Intel IPP as if it is a black box pocket of computation for their low-power or embedded device – ‘in’ flows the data and ‘out’ receives the result. In this fashion, using the Intel IPP can take the place of many processing units created for specific computational tasks. Intel IPP excels in a wide variety of domains where intelligent systems are utilized.
Without the benefit of highly optimized performance libraries, developers would need to optimize computationally intensive functions themselves carefully to obtain adequate performance. This optimization process is complicated, time consuming, and must be updated with each new processor generation. Intelligent systems often have a long lifetime in the field and there is a high maintenance effort to hand-optimize functions.
Signal processing and advanced vector math are the two function domains that are most in demand across the different types of intelligent systems. Frequently, a digital signal processor (DSP) is employed to assist the general purpose processor with these types of computational tasks. A DSP may come with its own well-defined application interface and library function set. However, it is usually poorly suited for general purpose tasks; DSPs are designed to quickly execute basic mathematical operations (add, subtract, multiply, and divide). The DSP repertoire includes a set of very fast multiply and accumulate (MAC) instructions to address matrix math evaluations that appear frequently in convolution, dot product and other multi-operand math operations. The MAC instructions that comprise much of the code in a DSP application are the equivalent of SIMD instruction sets. Like the MAC instructions on a DSP, these instruction sets perform mathematical operations very efficiently on vectors and arrays of data. Unlike a DSP, the Single Instruction Multiple Data (SIMD) instructions are easier to integrate into applications using complex vector and array mathematical algorithms since all computations execute on the same processor and are part of a unified logical execution stream.
For example, an algorithm that changes image brightness by adding (or subtracting) a constant value to each pixel of that image must read the RGB values from memory, add (or subtract) the offset, and write the new pixel values back to memory. When using a DSP coprocessor, that image data must be packaged for the DSP (placed in a memory area that is accessible by the DSP), signaled to execute the transformation algorithm, and finally returned to the general-purpose processor. Using a general-purpose processor with SIMD instructions simplifies this process of packaging, signaling, and returning the data set. Intel IPP primitives are optimized to match each SIMD instruction set architecture so that multiple versions of each primitive exist in the library.
Intel IPP can be reused over a wide range of Intel® Architecture based processors, and due to automatic dispatching, the developer’s code base will always pick the execution flow optimized for the architecture in question without having to change the underlying function call (Figure 2). This is especially helpful if an embedded system employs both an Intel® Core™ processor for data analysis/aggregation as well as a series of Intel® Atom™ Processor based SoCs for data pre-processing/collection. In that scenario, the same code base may be used in part on both the Intel® Atom™ Processor based SoCs in the field and the Intel® Core™ processor in the central data aggregation point.
With specialized SoC components for data streaming and I/O handling combined with a limited user interface, one may think that there are not a lot of opportunities to take advantage of optimizations and/or parallelism, but that is not the case. There is room for
- heterogeneous asynchronous multi-processing (AMP) based on different architectures, and
- synchronous multi-processing (SMP) taking advantage of the Intel® Hyper-Threading Technology and dual-core design used with the latest generation of processors designed for low-power intelligent systems.
Both concepts often coexist in the same SoC. Code with failsafe real-time requirements is protected within its own wrapper managed by a modified round-robin real-time scheduler, while the rest of the operating system (OS) and application layers are managed using standard SMP multi-processing concepts. Intel Atom Processors contain two Intel Hyper-Threading Technology based cores and may contain an additional two physical cores resulting in a quad-core system. In addition Intel Atom Processors support the Intel SSSE3 instruction set. A wide variety of Intel IPP functions found at http://software.intel.com/en-us/articles/new-atom-support are tuned to take advantage of Intel Atom Processor architecture specifics as well as Intel SSSE3.
Figure 2: Intel IPP is tuned to take advantage of the Intel Atom Processor and the Intel SSSE3 instruction set
Throughput intensive applications can benefit from the use of use of Intel SSSE3 vector instructions and parallel execution of multiple data streams through the use of extra-wide vector registers for SIMD processing. As just mentioned, modern Intel Atom Processor designs provide up to four virtual processor cores. This fact makes threading an interesting proposition. While there is no universal threading solution that is best for all applications, the Intel IPP has been designed to be thread-safe.
Intel IPP provides flexibility in linkage models to strike the right balance between portability and footprint management.
Table 1: Intel IPP Linkage Model Comparison
The standard dynamic and dispatched static models are the simplest options to use in building applications with the Intel IPP. The standard dynamic library includes the full set of processor optimizations and provides the benefit of runtime code sharing between multiple Intel IPP-based applications. Detection of the runtime processor and dispatching to the appropriate optimization layer is automatic.
If the number of Intel IPP functions used in your application is small, and the standard shared library objects are too large, using a custom dynamic library may be an alternative.
To optimize for minimal total binary footprint, linking against a non-dispatched static version of the library may be the approach to take. This approach yields an executable containing only the optimization layer required for your target processor. This model achieves the smallest footprint at the expense of restricting your optimization to one specific processor type and one SIMD instruction set. This linkage model is useful when a self-contained application running on only one processor type is the intended goal. It is also the recommended linkage model for use in kernel mode (ring 0) or device driver applications.
Intel IPP addresses both the needs of the native application developer found in the personal computing world and the intelligent system developer who must satisfy system requirements with the interaction between the application layer and the software stack underneath. By taking the Intel IPP into the world of middleware, drivers and OS interaction, it can be used for embedded devices. The limited dependency on OS libraries and its support for flexible linkage models makes it simple to add to embedded cross-build environments with popular GNU* based cross-build setups like Poky-Linux* or MADDE*.
Developing for intelligent systems and small form factor devices frequently means that native development is not a feasible option. Intel IPP can be easily integrated with a cross-build environment and be used with cross-build toolchains that accommodate the flow requirements of many of these real-time systems. Use of the Intel IPP allows embedded intelligent systems to take advantage of vector instructions and extra-wide vector registers on the Intel Atom Processor. Developers can also meet determinism requirements without increasing the risks associated with cross-architecture data handshakes of complex SoC architectures.
Developing for embedded small form factor devices also means that applications with deterministic execution flow requirements have to interface more directly with the system software layer and the OS scheduler. Software development utilities and libraries for this space need to be able to work with the various layers of the software stack, whether it is the end-user application or the driver that assists with a particular data stream or I/O interface. The Intel IPP has minimal OS dependencies and a well-defined ABI to work with the various modes. One can apply highly optimized functions for embedded signal and multimedia processing across the platform software stack while taking advantage of the underlying application processor architecture and its strengths, all without redesigning and returning the critical functions with successive hardware platform upgrades.
Intel® Math Kernel Library (Intel® MKL)
IntelMKL includes routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. Intel MKL includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions.
The Intel® Math Kernel Library includes the following groups of routines:
- Basic Linear Algebra Subprograms (BLAS):
- Vector operations
- Matrix-vector operations
- Matrix-matrix operations
- Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices)
- LAPACK routines for solving systems of linear equations
- LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations
- Auxiliary and utility LAPACK routines
- ScaLAPACK computational, driver and auxiliary routines
- PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation
- Direct and Iterative Sparse Solver routines
- Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments
- Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with different types of statistical distributions and for performing convolution and correlation computations
- General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier Transform via the FFT algorithms
- Tools for solving partial differential equations - trigonometric transform routines and Poisson solver
- Optimization Solver routines for solving nonlinear least squares problems through the Trust-Region (TR) algorithms and computing Jacobi matrix by central differences
- Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra oriented message passing interface
- Data Fitting functions for spline-based approximation of functions, derivatives and integrals of functions, and search
Intel® System Studio for Signal Processing
For more detailed information about choosing IPP or MKL for FFT, please visit https://software.intel.com/en-us/articles/mkl-ipp-choosing-an-fft.
Choosing the Best FFT for Your Application
Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:
- What are the performance requirements for the application? How is performance measured, and what is the measurement criteria? Is a specific benchmark being used? What are the known performance bottlenecks?
- What type of application is being developed? What are the main operations being performed and on what kind of data?
- What API is currently being used in the application for transforms? What programming language(s) is the application code written in?
- Does the FFT output data need to be scaled (normalized)? What type of scaling is required?
- What kind of input and output data does the transform process? What are the valid and invalid values? What type of precision is required?