Signal Processing Usage for Intel® System Studio – Intel® MKL vs. Intel® IPP

Employing performance libraries can be a great way to streamline and unify the computational execution flow for data intensive tasks, thus minimizing the risk of data stream timing issues and heisenbugs. Here we will describe the two libraries that can be used for signal processing within Intel® System Studio.

Intel® Integrated Performance Primitives (Intel®IPP)

Performance libraries such as the Intel IPP contain highly optimized algorithms and code for common functions including as signal processing, image processing, video/audio encode/decode, cryptography, data compression, speech coding, and computer vision. Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored for certain applications. One calls Intel IPP as if it is a black box pocket of computation for their low-power or embedded device – ‘in’ flows the data and ‘out’ receives the result. In this fashion, using the Intel IPP can take the place of many processing units created for specific computational tasks. Intel IPP excels in a wide variety of domains where intelligent systems are utilized.                        

Without the benefit of highly optimized performance libraries, developers would need to optimize computationally intensive functions themselves carefully to obtain adequate performance. This optimization process is complicated, time consuming, and must be updated with each new processor generation. Intelligent systems often have a long lifetime in the field and there is a high maintenance effort to hand-optimize functions.

Signal processing and advanced vector math are the two function domains that are most in demand across the different types of intelligent systems. Frequently, a digital signal processor (DSP) is employed to assist the general purpose processor with these types of computational tasks. A DSP may come with its own well-defined application interface and library function set. However, it is usually poorly suited for general purpose tasks; DSPs are designed to quickly execute basic mathematical operations (add, subtract, multiply, and divide). The DSP repertoire includes a set of very fast multiply and accumulate (MAC) instructions to address matrix math evaluations that appear frequently in convolution, dot product and other multi-operand math operations. The MAC instructions that comprise much of the code in a DSP application are the equivalent of SIMD instruction sets. Like the MAC instructions on a DSP, these instruction sets perform mathematical operations very efficiently on vectors and arrays of data. Unlike a DSP, the Single Instruction Multiple Data (SIMD) instructions are easier to integrate into applications using complex vector and array mathematical algorithms since all computations execute on the same processor and are part of a unified logical execution stream.

For example, an algorithm that changes image brightness by adding (or subtracting) a constant value to each pixel of that image must read the RGB values from memory, add (or subtract) the offset, and write the new pixel values back to memory. When using a DSP coprocessor, that image data must be packaged for the DSP (placed in a memory area that is accessible by the DSP), signaled to execute the transformation algorithm, and finally returned to the general-purpose processor. Using a general-purpose processor with SIMD instructions simplifies this process of packaging, signaling, and returning the data set. Intel IPP primitives are optimized to match each SIMD instruction set architecture so that multiple versions of each primitive exist in the library.

Intel IPP can be reused over a wide range of Intel® Architecture based processors, and due to automatic dispatching, the developer’s code base will always pick the execution flow optimized for the architecture in question without having to change the underlying function call (Figure 2). This is especially helpful if an embedded system employs both an Intel® Core™ processor for data analysis/aggregation as well as a series of Intel® Atom™ Processor based SoCs for data pre-processing/collection. In that scenario, the same code base may be used in part on both the Intel® Atom™ Processor based SoCs in the field and the Intel® Core™ processor in the central data aggregation point.                                     

With specialized SoC components for data streaming and I/O handling combined with a limited user interface, one may think that there are not a lot of opportunities to take advantage of optimizations and/or parallelism, but that is not the case. There is room for

-       heterogeneous asynchronous multi-processing (AMP) based on different architectures, and

 -       synchronous multi-processing (SMP) taking advantage of the Intel® Hyper-Threading Technology and dual-core design used with the latest generation of processors designed for low-power intelligent systems.

Both concepts often coexist in the same SoC. Code with failsafe real-time requirements is protected within its own wrapper managed by a modified round-robin real-time scheduler, while the rest of the operating system (OS) and application layers are managed using standard SMP multi-processing concepts. Intel Atom Processors contain two Intel Hyper-Threading Technology based cores and may contain an additional two physical cores resulting in a quad-core system. In addition Intel Atom Processors support the Intel SSSE3 instruction set. A wide variety of Intel IPP functions found at http://software.intel.com/en-us/articles/new-atom-support are tuned to take advantage of Intel Atom Processor architecture specifics as well as Intel SSSE3.                                      

                                      

Figure 2: Intel IPP is tuned to take advantage of the Intel Atom Processor and the Intel SSSE3 instruction set

Throughput intensive applications can benefit from the use of use of Intel SSSE3 vector instructions and parallel execution of multiple data streams through the use of extra-wide vector registers for SIMD processing. As just mentioned, modern Intel Atom Processor designs provide up to four virtual processor cores. This fact makes threading an interesting proposition. While there is no universal threading solution that is best for all applications, the Intel IPP has been designed to be thread-safe.

Intel IPP provides flexibility in linkage models to strike the right balance between portability and footprint management.

Table 1: Intel IPP Linkage Model Comparison

The standard dynamic and dispatched static models are the simplest options to use in building applications with the Intel IPP. The standard dynamic library includes the full set of processor optimizations and provides the benefit of runtime code sharing between multiple Intel IPP-based applications. Detection of the runtime processor and dispatching to the appropriate optimization layer is automatic.

If the number of Intel IPP functions used in your application is small, and the standard shared library objects are too large, using a custom dynamic library may be an alternative.

To optimize for minimal total binary footprint, linking against a non-dispatched static version of the library may be the approach to take. This approach yields an executable containing only the optimization layer required for your target processor. This model achieves the smallest footprint at the expense of restricting your optimization to one specific processor type and one SIMD instruction set. This linkage model is useful when a self-contained application running on only one processor type is the intended goal. It is also the recommended linkage model for use in kernel mode (ring 0) or device driver applications.

Intel IPP addresses both the needs of the native application developer found in the personal computing world and the intelligent system developer who must satisfy system requirements with the interaction between the application layer and the software stack underneath. By taking the Intel IPP into the world of middleware, drivers and OS interaction, it can be used for embedded devices. The limited dependency on OS libraries and its support for flexible linkage models makes it simple to add to embedded cross-build environments with popular GNU* based cross-build setups like Poky-Linux* or MADDE*.

Developing for intelligent systems and small form factor devices frequently means that native development is not a feasible option. Intel IPP can be easily integrated with a cross-build environment and be used with cross-build toolchains that  accommodate the flow requirements of many of these real-time systems. Use of the Intel IPP allows embedded intelligent systems to take advantage of  vector instructions and extra-wide vector registers on the Intel Atom Processor. Developers can also meet determinism requirements without increasing the risks associated with cross-architecture data handshakes of complex SoC architectures.

Developing for embedded small form factor devices also means that applications with deterministic execution flow requirements have to interface more directly with the system software layer and the OS scheduler. Software development utilities and libraries for this space need to be able to work with the various layers of the software stack, whether it is the end-user application or the driver that assists with a particular data stream or I/O interface. The Intel IPP has minimal OS dependencies and a well-defined ABI to work with the various modes. One can apply highly optimized functions for embedded signal and multimedia processing across the platform software stack while taking advantage of the underlying application processor architecture and its strengths, all without redesigning and returning the critical functions with successive hardware platform upgrades.

Intel® Math Kernel Library (Intel® MKL)

IntelMKL includes routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. Intel MKL includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions.

The Intel® Math Kernel Library includes the following groups of routines:

-       Basic Linear Algebra Subprograms (BLAS):

  • Vector operations
  • Matrix-vector operations
  • Matrix-matrix operations

-       Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices)

-       LAPACK routines for solving systems of linear equations

-       LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations

-       Auxiliary and utility LAPACK routines

-       ScaLAPACK computational, driver and auxiliary routines

-       PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation

-       Direct and Iterative Sparse Solver routines

-       Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments

-       Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with different types of statistical distributions and for performing convolution and correlation computations

-       General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier Transform via the FFT algorithms

-       Tools for solving partial differential equations - trigonometric transform routines and Poisson solver

-       Optimization Solver routines for solving nonlinear least squares problems through the Trust-Region (TR) algorithms and computing Jacobi matrix by central differences

-       Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra oriented message passing interface

-       Data Fitting functions for spline-based approximation of functions, derivatives and integrals of functions, and search

Intel IPP and Intel MKL for Signal Processing

The next question is when to use one Fourier Transform over another with respect to Intel IPP and Intel MKL.

DFT processing time can dominate a software application. Using a fast algorithm, Fast Fourier transform (FFT), reduces the number of arithmetic operations from O(N2) to O(N log2 N) operations. Intel MKL and Intel IPP are highly optimized for Intel architecture-based multi-core processors using the latest instruction sets, parallelism, and algorithms.

Read further to decide which FFT is best for your application.

Table 2: Comparison of Intel MKL and Intel IPP Functionality

 

 Intel MKL

Intel IPP

Target Applications

Mathematical applications for engineering, scientific and financial applications

Media and communications applications for audio, video, imaging, speech recognition and signal processing

Library Structure

  • Linear algebra
  • BLAS
  • LAPACK
  • ScaLAPACK
  • Fast Fourier transforms
  • Vector math
  • Vector statistics
  • Random number generators
  • Convolution and correlation
  • Partial differential equations
  • Optimization solvers
  • Audio coding
  • Image processing, compression and color conversion
  • String processing
  • Cryptography
  • Computer vision
  • Data compression
  • Matrix math
  • Signal processing
  • Speech coding and recognition
  • Video coding
  • Vector math
  • Rendering

Linkage Models

Static, dynamic, custom dynamic

Static, dynamic, custom dynamic

Operating Systems

Linux*

Linux*

Processor Support

IA-32 and Intel® 64 architecture-based and compatible platforms, IA-64

IA-32 and Intel® 64 architecture-based and compatible platforms, IA-64, Intel IXP4xx Processors


Intel MKL and Intel IPP Fourier Transform Features

The Fourier Transforms provided by MKL and IPP are respectively targeted for the types of applications targeted by MKL (engineering and scientific) and IPP (media and communications).  In the table below, we highlight specifics of the MKL and IPP Fourier Transforms that will help you decide which may be best for your application.

Table 3: Comparison of Intel MKL and Intel IPP DFT Features

Feature

Intel MKL

Intel IPP

API

DFT
Cluster FFT
FFTW 2.x and 3.x

FFT
DFT

Interfaces

C

LP64 (64-bit long and pointer)
ILP64 (64-bit int, long,  and pointer)

C

Dimensions

1-D up to 7-D

1-D (Signal Processing)
2-D (Image Processing)

Transform Sizes

32-bit platforms - maximum size is 2^31-1
64-bit platforms - 264 maximum size

FFT - Powers of 2 only

DFT -232 maximum size (*)

Mixed Radix Support

2,3,5,7 kernels ( **)

DFT - 2,3,5,7 kernels (**)

Data Types

(See Table 3 for detail)

Real & Complex
Single- & Double-Precision

Real & Complex
Single- & Double-Precision

Scaling

Transforms can be scaled by an arbitrary floating point number (with precision the same as input data)

Integer ("fixed") scaling

  • Forward 1/N
  • Inverse 1/N
  • Forward + Inverse  SQRT (1/N)

Threading

Platform dependent

  • IA-32: All (except 1D when performing a single transform and sizes are not power of two)
  • Intel® 64: All (except in-place power of two)
  • IA-64: All

Can use as many threads as needed on MP systems.

1D and 2D

 

Accuracy


High accuracy

 

High Accurate


Data Types and Formats

The Intel MKL and Intel IPP Fourier transform functions support a variety of data types and formats for storing signal values. Mixed types interfaces are also supported. Please see the product documentation for details.

Table 4: Comparison of Intel MKL and Intel IPP Data Types and Formats

Feature

Intel MKL

Intel IPP

Real FFTs

Precision

Single, Double

Single, Double

1D Data Types

Real for all dimensions

Signed short, signed int, float, double

2D Data Types

Real for all dimensions

Unsigned char, signed int, float

1D Packed Formats

CCS, Pack, Perm, CCE

CCS, Pack, Perm

2D Packed Formats

CCS, Pack, Perm, CCE

RCPack2D

3D Packed Formats

CCE

N/A

Format Conversion Functions

 

 

Complex FFTs

Precision

Single, Double

Single, Double

1D Data Types

Complex for all dimensions

Signed short, complex short, signed int, complex integer, complex float, complex double

2D Data Types

Complex for all dimensions

Complex float

Formats Legend
CCE - stores the values of the first half of the output complex conjugate-even signal
CCS - same format as CCE format for 1D, is slightly different for multi-dimensional real transforms
For 2D transforms. CCS, Pack, Perm are not supported for 3D and higher rank
Pack - compact representation of a complex conjugate-symmetric sequence
Perm - same as Pack format for odd lengths, arbitrary permutation of the Pack format for even lengths
RCPack2D - exploits the complex conjugate symmetry of the transformed data to store only half of the resulting Fourier coefficients

Performance

The Intel MKL and Intel IPP are optimized for current and future Intel® processors, and they are specifically tuned for two different usage areas:

  • Intel MKL is suitable for large problem sizes
  • Intel IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications.

Choosing the Best FFT for Your Application

Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:

  • What are the performance requirements for the application? How is performance measured, and what is the measurement criteria? Is a specific benchmark being used? What are the known performance bottlenecks?
  • What type of application is being developed? What are the main operations being performed and on what kind of data?
  • What API is currently being used in the application for transforms? What programming language(s) is the application code written in?
  • Does the FFT output data need to be scaled (normalized)? What type of scaling is required?
  • What kind of input and output data does the transform process? What are the valid and invalid values? What type of precision is required?
For more complete information about compiler optimizations, see our Optimization Notice.