By Noah Clemons (Intel), Added
Employing performance libraries can be a great way to streamline and unify the computational execution flow for data intensive tasks, thus minimizing the risk of data stream timing issues and heisenbugs. Here we will describe the two libraries that can be used for signal processing within Intel^{®} System Studio.
Intel® Integrated Performance Primitives (Intel®^{}IPP)
Performance libraries such as the Intel IPP contain highly optimized algorithms and code for common functions including as signal processing, image processing, video/audio encode/decode, cryptography, data compression, speech coding, and computer vision. Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored for certain applications. One calls Intel IPP as if it is a black box pocket of computation for their lowpower or embedded device – ‘in’ flows the data and ‘out’ receives the result. In this fashion, using the Intel IPP can take the place of many processing units created for specific computational tasks. Intel IPP excels in a wide variety of domains where intelligent systems are utilized.
Without the benefit of highly optimized performance libraries, developers would need to optimize computationally intensive functions themselves carefully to obtain adequate performance. This optimization process is complicated, time consuming, and must be updated with each new processor generation. Intelligent systems often have a long lifetime in the field and there is a high maintenance effort to handoptimize functions.
Signal processing and advanced vector math are the two function domains that are most in demand across the different types of intelligent systems. Frequently, a digital signal processor (DSP) is employed to assist the general purpose processor with these types of computational tasks. A DSP may come with its own welldefined application interface and library function set. However, it is usually poorly suited for general purpose tasks; DSPs are designed to quickly execute basic mathematical operations (add, subtract, multiply, and divide). The DSP repertoire includes a set of very fast multiply and accumulate (MAC) instructions to address matrix math evaluations that appear frequently in convolution, dot product and other multioperand math operations. The MAC instructions that comprise much of the code in a DSP application are the equivalent of SIMD instruction sets. Like the MAC instructions on a DSP, these instruction sets perform mathematical operations very efficiently on vectors and arrays of data. Unlike a DSP, the Single Instruction Multiple Data (SIMD) instructions are easier to integrate into applications using complex vector and array mathematical algorithms since all computations execute on the same processor and are part of a unified logical execution stream.
For example, an algorithm that changes image brightness by adding (or subtracting) a constant value to each pixel of that image must read the RGB values from memory, add (or subtract) the offset, and write the new pixel values back to memory. When using a DSP coprocessor, that image data must be packaged for the DSP (placed in a memory area that is accessible by the DSP), signaled to execute the transformation algorithm, and finally returned to the generalpurpose processor. Using a generalpurpose processor with SIMD instructions simplifies this process of packaging, signaling, and returning the data set. Intel IPP primitives are optimized to match each SIMD instruction set architecture so that multiple versions of each primitive exist in the library.
Intel IPP can be reused over a wide range of Intel® Architecture based processors, and due to automatic dispatching, the developer’s code base will always pick the execution flow optimized for the architecture in question without having to change the underlying function call (Figure 2). This is especially helpful if an embedded system employs both an Intel® Core™ processor for data analysis/aggregation as well as a series of Intel® Atom™ Processor based SoCs for data preprocessing/collection. In that scenario, the same code base may be used in part on both the Intel® Atom™ Processor based SoCs in the field and the Intel® Core™ processor in the central data aggregation point.
With specialized SoC components for data streaming and I/O handling combined with a limited user interface, one may think that there are not a lot of opportunities to take advantage of optimizations and/or parallelism, but that is not the case. There is room for
 heterogeneous asynchronous multiprocessing (AMP) based on different architectures, and
 synchronous multiprocessing (SMP) taking advantage of the Intel® HyperThreading Technology and dualcore design used with the latest generation of processors designed for lowpower intelligent systems.
Both concepts often coexist in the same SoC. Code with failsafe realtime requirements is protected within its own wrapper managed by a modified roundrobin realtime scheduler, while the rest of the operating system (OS) and application layers are managed using standard SMP multiprocessing concepts. Intel Atom Processors contain two Intel HyperThreading Technology based cores and may contain an additional two physical cores resulting in a quadcore system. In addition Intel Atom Processors support the Intel SSSE3 instruction set. A wide variety of Intel IPP functions found at http://software.intel.com/enus/articles/newatomsupport are tuned to take advantage of Intel Atom Processor architecture specifics as well as Intel SSSE3.
Figure 2: Intel IPP is tuned to take advantage of the Intel Atom Processor and the Intel SSSE3 instruction set
Throughput intensive applications can benefit from the use of use of Intel SSSE3 vector instructions and parallel execution of multiple data streams through the use of extrawide vector registers for SIMD processing. As just mentioned, modern Intel Atom Processor designs provide up to four virtual processor cores. This fact makes threading an interesting proposition. While there is no universal threading solution that is best for all applications, the Intel IPP has been designed to be threadsafe.
Intel IPP provides flexibility in linkage models to strike the right balance between portability and footprint management.
Table 1: Intel IPP Linkage Model Comparison
The standard dynamic and dispatched static models are the simplest options to use in building applications with the Intel IPP. The standard dynamic library includes the full set of processor optimizations and provides the benefit of runtime code sharing between multiple Intel IPPbased applications. Detection of the runtime processor and dispatching to the appropriate optimization layer is automatic.
If the number of Intel IPP functions used in your application is small, and the standard shared library objects are too large, using a custom dynamic library may be an alternative.
To optimize for minimal total binary footprint, linking against a nondispatched static version of the library may be the approach to take. This approach yields an executable containing only the optimization layer required for your target processor. This model achieves the smallest footprint at the expense of restricting your optimization to one specific processor type and one SIMD instruction set. This linkage model is useful when a selfcontained application running on only one processor type is the intended goal. It is also the recommended linkage model for use in kernel mode (ring 0) or device driver applications.
Intel IPP addresses both the needs of the native application developer found in the personal computing world and the intelligent system developer who must satisfy system requirements with the interaction between the application layer and the software stack underneath. By taking the Intel IPP into the world of middleware, drivers and OS interaction, it can be used for embedded devices. The limited dependency on OS libraries and its support for flexible linkage models makes it simple to add to embedded crossbuild environments with popular GNU* based crossbuild setups like PokyLinux* or MADDE*.
Developing for intelligent systems and small form factor devices frequently means that native development is not a feasible option. Intel IPP can be easily integrated with a crossbuild environment and be used with crossbuild toolchains that accommodate the flow requirements of many of these realtime systems. Use of the Intel IPP allows embedded intelligent systems to take advantage of vector instructions and extrawide vector registers on the Intel Atom Processor. Developers can also meet determinism requirements without increasing the risks associated with crossarchitecture data handshakes of complex SoC architectures.
Developing for embedded small form factor devices also means that applications with deterministic execution flow requirements have to interface more directly with the system software layer and the OS scheduler. Software development utilities and libraries for this space need to be able to work with the various layers of the software stack, whether it is the enduser application or the driver that assists with a particular data stream or I/O interface. The Intel IPP has minimal OS dependencies and a welldefined ABI to work with the various modes. One can apply highly optimized functions for embedded signal and multimedia processing across the platform software stack while taking advantage of the underlying application processor architecture and its strengths, all without redesigning and returning the critical functions with successive hardware platform upgrades.
Intel^{®} Math Kernel Library (Intel^{® }MKL)
Intel^{}MKL includes routines and functions optimized for Intel® processorbased computers running operating systems that support multiprocessing. Intel MKL includes a Clanguage interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions.
The Intel® Math Kernel Library includes the following groups of routines:
 Basic Linear Algebra Subprograms (BLAS):
 Vector operations
 Matrixvector operations
 Matrixmatrix operations
 Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices)
 LAPACK routines for solving systems of linear equations
 LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations
 Auxiliary and utility LAPACK routines
 ScaLAPACK computational, driver and auxiliary routines
 PBLAS routines for distributed vector, matrixvector, and matrixmatrix operation
 Direct and Iterative Sparse Solver routines
 Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments
 Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with different types of statistical distributions and for performing convolution and correlation computations
 General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier Transform via the FFT algorithms
 Tools for solving partial differential equations  trigonometric transform routines and Poisson solver
 Optimization Solver routines for solving nonlinear least squares problems through the TrustRegion (TR) algorithms and computing Jacobi matrix by central differences
 Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra oriented message passing interface
 Data Fitting functions for splinebased approximation of functions, derivatives and integrals of functions, and search
Intel IPP and Intel MKL for Signal Processing
The next question is when to use one Fourier Transform over another with respect to Intel IPP and Intel MKL.
DFT processing time can dominate a software application. Using a fast algorithm, Fast Fourier transform (FFT), reduces the number of arithmetic operations from O(N^{2}) to O(N log_{2} N) operations. Intel MKL and Intel IPP are highly optimized for Intel architecturebased multicore processors using the latest instruction sets, parallelism, and algorithms.
Read further to decide which FFT is best for your application.
Table 2: Comparison of Intel MKL and Intel IPP Functionality

Intel MKL 
Intel IPP 
Target Applications 
Mathematical applications for engineering, scientific and financial applications 
Media and communications applications for audio, video, imaging, speech recognition and signal processing 
Library Structure 


Linkage Models 
Static, dynamic, custom dynamic 
Static, dynamic, custom dynamic 
Operating Systems 
Linux* 
Linux* 
Processor Support 
IA32 and Intel® 64 architecturebased and compatible platforms, IA64 
IA32 and Intel® 64 architecturebased and compatible platforms, IA64, Intel IXP4xx Processors 
Intel MKL and Intel IPP Fourier Transform Features
The Fourier Transforms provided by MKL and IPP are respectively targeted for the types of applications targeted by MKL (engineering and scientific) and IPP (media and communications). In the table below, we highlight specifics of the MKL and IPP Fourier Transforms that will help you decide which may be best for your application.
Table 3: Comparison of Intel MKL and Intel IPP DFT Features
Feature 
Intel MKL 
Intel IPP 
API 
DFT 
FFT 
Interfaces 
C LP64 (64bit long and pointer) 
C 
Dimensions 
1D up to 7D 
1D (Signal Processing) 
Transform Sizes 
32bit platforms  maximum size is 2^311 
FFT  Powers of 2 only DFT 2^{32} maximum size (*) 
Mixed Radix Support 
2,3,5,7 kernels ( **) 
DFT  2,3,5,7 kernels (**) 
Data Types (See Table 3 for detail) 
Real & Complex 
Real & Complex 
Scaling 
Transforms can be scaled by an arbitrary floating point number (with precision the same as input data) 
Integer ("fixed") scaling

Threading 
Platform dependent
Can use as many threads as needed on MP systems. 
1D and 2D

Accuracy 

High Accurate 
Data Types and Formats
The Intel MKL and Intel IPP Fourier transform functions support a variety of data types and formats for storing signal values. Mixed types interfaces are also supported. Please see the product documentation for details.
Table 4: Comparison of Intel MKL and Intel IPP Data Types and Formats
Feature 
Intel MKL 
Intel IPP 
Real FFTs 

Precision 
Single, Double 
Single, Double 
1D Data Types 
Real for all dimensions 
Signed short, signed int, float, double 
2D Data Types 
Real for all dimensions 
Unsigned char, signed int, float 
1D Packed Formats 
CCS, Pack, Perm, CCE 
CCS, Pack, Perm 
2D Packed Formats 
CCS, Pack, Perm, CCE 
RCPack2D 
3D Packed Formats 
CCE 
N/A 
Format Conversion Functions 


Complex FFTs 

Precision 
Single, Double 
Single, Double 
1D Data Types 
Complex for all dimensions 
Signed short, complex short, signed int, complex integer, complex float, complex double 
2D Data Types 
Complex for all dimensions 
Complex float 
Formats Legend
CCE  stores the values of the first half of the output complex conjugateeven signal
CCS  same format as CCE format for 1D, is slightly different for multidimensional real transforms
For 2D transforms. CCS, Pack, Perm are not supported for 3D and higher rank
Pack  compact representation of a complex conjugatesymmetric sequence
Perm  same as Pack format for odd lengths, arbitrary permutation of the Pack format for even lengths
RCPack2D  exploits the complex conjugate symmetry of the transformed data to store only half of the resulting Fourier coefficients
Performance
The Intel MKL and Intel IPP are optimized for current and future Intel^{®} processors, and they are specifically tuned for two different usage areas:
 Intel MKL is suitable for large problem sizes
 Intel IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications.
Choosing the Best FFT for Your Application
Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:
 What are the performance requirements for the application? How is performance measured, and what is the measurement criteria? Is a specific benchmark being used? What are the known performance bottlenecks?
 What type of application is being developed? What are the main operations being performed and on what kind of data?
 What API is currently being used in the application for transforms? What programming language(s) is the application code written in?
 Does the FFT output data need to be scaled (normalized)? What type of scaling is required?
 What kind of input and output data does the transform process? What are the valid and invalid values? What type of precision is required?
Add a Comment
Top(For technical discussions visit our developer forums. For site or software product issues contact support.)
Please sign in to add a comment. Not a member? Join today