Questions and Answers from the Intel® Integrated Performance Primitives Webinar on November 18, 2010

The following Q&A session is from the webinar titled "Accelerate your Multimedia and Data Processing Applications with the Intel® IPP 7.0 Library" presented on November 18, 2010 by Paul Fischer of Intel Corporation.

You can download and view a recording of the webinar as well as a PDF file of the slides.

Q: I'm new to IPP, is there going to be a basic introduction?
A: From the Intel IPP home page click the ‘Resources' tab to locate the Intel IPP Documentation (User’s Guide and Programming Reference), ‘Video' and ‘Learning Lab Portal' links to get started. Also you can review the webinar titled "Super Charge Applications with: Intel® Integrated Performance Primitives A Component of Intel® Parallel Studio" for a basic introduction to the Intel IPP iibrary.
Q: Are we allowed to ship an open-source product that uses IPP ?
A: The Intel IPP library can be used with open source projects. Users need to make note of the terms of the open source license associated with their open source project. Please refer to the Intel IPP end-user license agreement, redistribution details and the Intel IPP Licensing FAQs for more details.
Q: Are there any benchmarks for AVX on DFT's or laplacian transforms using reals and not integers?
A: There is a performance test utility included with the Intel IPP library called perfsys which allows you to measure the performance of individual functions by clocks per element. It is located in the <install>/ipp/tools/<arch>/perfsys directory. You can use this to compare the performance of functions between different optimizations and different versions of the library. It uses a command-line interface.
Q: Are there existing benchmark comparisons between most popular FFT libraries (e.g.: FFTW) and IPP's FFT?
A: We don't have any benchmark comparisons between other FFT libs and IPP FFT, but we do provide comparisons between MKL FFT implementation and FFTW, which can be found on the MKL product website.
Q: Is Intel AVX a feature of a new processor architecture or is it an Intel IPP feature?
A: Intel AVX, or Intel Advanced Vector Extensions, is an extension to the SIMD instruction set starting with the Sandy Bridge microarchitecture. Intel IPP makes use of these instructions to speed up many functions in the library.
Q: Are any of the Intel AVX functions multi-threaded?
A: Yes, if a function appears both in the list of functions optimized for AVX and in the list of threaded functions, it is threaded (for example ippiFilter_32f_C1R or ippiCrossCorrValid_NormLevel_32f(8u32f,8u,etc)_C1R). The only exception to this rule are the 1D FFT functions that have been threaded specifically for Intel Core 2 processors and do not utilize threading in Penryn and older CPUs. For addtional details on Intel IPP threaded APIs, please check the "ThreadedFunctionsList.txt" file located in the Intel IPP \documentation\ folder.
Q: Will Intel AVX optimized code run on AMD processors when they introduce their new instructions?
A: Yes, if the AMD processor supports Intel AVX instructions and the CPUID bits in that AMD processor indicate support for Intel AVX in a manner that is compatible with an Intel processor, which Intel does not control and cannot guarantee.


Additional questions and answers regarding the Intel IPP 7.0 library can be found in these two articles:

Intel IPP 7.0 Beta Webinar - Questions and Answers (FAQs)
Questions and Answers from the Intel® Integrated Performance Primitives Webinar on October 26, 2010

The slides and notes for those slides follow.

Slide1.jpg

Slide2.gif

Slide3.gif

Slide4.gif

notes for slide #4:

/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions
/en-us/articles/ipp-dispatcher-control-functions-ippinit-functions

Data compression high-level libraries: zlib, gzip, bzip2 and lzopack

UIC/JPEG performance improvement via multi-threading.

Windows Imaging Component (WIC)
- Standard Windows interface to codecs
- Easiest way for mainstream developers to evaluate and utilize IPP parallelized codecs
- Provide as layer atop existing cross-platform IPP Unified Image Codec API
- Primarily optimized for AVX machines
- Provided as part of IPP samples

Visual Studio 2010 support means support for VS2010 integrated help as well as VS2010 solution files for building high-level samples and libraries. VS2005 and VS2008 continue to be supported by the 7.0 release of the product.

A complete list of new functions added with the 7.0 release is located in the NewFunctionsList.txt file, which can be found in the ...\Documentation\en_US\ipp\ directory. You will also find the ThreadedFunctionsList.txt file in that same location, which lists those functions that are available in an internally threaded format. Threading, within the multi-threaded variants of the Intel IPP library, is accomplished by use of the Intel® OpenMP* library.

Slide5.gif

notes for slide #5:

3 pillars: performance, reliability, multi OS flexibility

Compiler/library: best optimizing compiler for IA, personifying the years of experience intel engineers have. Higher programmability, fortran

Correctness tools: help find errors, error checking, single tool finding threading and memory errors.

Profilers: performance analysis. Detects hotspots in code to alllow developer to tune code for higher efficiency. Provides detailed insight into architecture.

All tools in a single suite, higher performance when compared to other comparable tools in the industry, (Intel engineers know our CPU’s). Code reliability which reduces time to market. VTune gives you all details on what’s happening on your chip. Tap into Intel’s deep engineering expertise simulation, video rendering, seismic analysis, financial, etc.

Slide6.jpg

Slide7.gif

notes for slide #7:

Intel® IPP Functions Optimized for Intel® AVX (Intel® Advanced Vector Extensions)
/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions

Those functions that have not been hand-optimized have been compiler-optimized using the Intel Compiler /QxG switch (enable AVX optimization) to take advantage of the AVX "3rd operand" feature. Further performance improvements are achieved by virtue of an AVX ABI (application binary interface) feature that inserts the special AVX "vzeroupper" instruction after any function with AVX code to eliminate AVX->SSE transition penalties.

IPP 7.0 enables avx code by default. The 6.1 library releases requires an “enable” function.
/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions
/en-us/articles/ipp-dispatcher-control-functions-ippinit-functions

Slide8.gif

Slide9.jpg 

notes for slide #9:

AES-NI is a set of new instructions for enhancing the performance for cryptography using the widely-accepted Advanced Encryption Standard (AES) algorithm.

There are seven new AES instructions that target some of the more complex and compute-expensive encryption, decryption, key expansion and multiplication steps (and there are multiple steps in every instance of working with encrypted data) that increase the performance and efficiency of these operations. But note that the instructions do not implement the entire AES algorithm in silicon-only the most processor intensive elements have been targeted. This provides more flexibility and balance between HW performance and SW extensibility.

Another benefit of the new instructions is that actually helps protect the data better as well. The use of the more efficient steps enabled in AES-NI makes the use of “side channel” snooping attacks. These attacks use SW agents to analyze how a system processes data and searches for cache and memory access patterns to try to gather patterns or other system data to help deduce elements of the cryptographic processing-and therefore make it easier to “crack.” AES-NI helps hide critical elements such as table lookups, making it harder to determine what elements of crypto processing are happening.

Slide10.gif 

notes for slide #10:

AESENC, AESENCLAST, AESDEC, AESDECLAST and PCLMULQDQ are used in implementation of the following AES modes: ECB, CBC, OFB, CFB, CTR, CCM and CGM

In the Intel IPP library the AESKEYGENASSIST and AESIMC instructions are not used because IPP focuses on performance processing.

Slide11.gif 

notes for slide #11:

Intel® AES instructions are a new set of instructions available beginning with the all new 2010 Intel® Core™ processor family based on the 32nm Intel® microarchitecture codename Westmere. These instructions enable fast and secure data encryption and decryption, using the Advanced Encryption Standard (AES) which is defined by FIPS Publication number 197. Since AES is currently the dominant block cipher, and it is used in various protocols, the new instructions are valuable for a wide range of applications. The architecture consists of six instructions that offer full hardware support for AES. Four instructions support the AES encryption and decryption, and other two instructions support the AES key expansion. The AES instructions have the flexibility to support all usages of AES, including all standard key lengths, standard modes of operation, and even some nonstandard or future variants. They offer a significant increase in performance compared to the current pure-software implementations. Beyond improving performance, the AES instructions provide important security benefits. By running in data-independent time and not using tables, they help in eliminating the major timing and cache-based attacks that threaten table-based software implementations of AES. In addition, they make AES simple to implement, with reduced code size, which helps reducing the risk of inadvertent introduction of security flaws, such as difficult-to-detect side channel leaks.

/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set

Slide12.gif

Slide13.gif

Slide14.gif 

notes for slide #14:

No longer a sample. Now provided as an integral and supported part of the IPP library product, in the “interfaces” directory. Ready to link binaries provided, compiled with the Intel compiler. Provide high-level data compression APIs compatible with de-facto standards. Provide low-level IPP functions as building blocks for individual data compression schemes. Improve performance over third-party equivalent not only via SIMD but also via multi-threading optimizations.

Improve CRC32 calculations using SIMD architecture features:
SSE 4.2 optimized using CRC32 → v6.1
AES-NI optimized using PCLMULQDQ → v7.0

Low Level Data Compression Primitives:
- Burrows-Wheeler Transform (BWT) bzip2);
- Move-To-Front Transform (MTF) (bzip2);
- Run Length Encoding (RLE) (bzip2);
- Huffman coding (zlib);
- Dictionary based method with sliding window Lempel-Ziv-Storer-Szimanskiy (LZSS);
- Variable Length Coding (VLC);
- Hash functions for checksum calculation CRC32 (zlib and bzip2 versions), CRC32C and Alder32

Slide15.gif 

notes for slide #15:

Dynamic linkage by default for all libs and apps - see Makefile to change default:
!IF "$(LINKAGE)" == "" 
# LINKAGE = dynamic
LINKAGE = static !
ENDIF

- zlib
www.zlib.net
based on 1.2.3 version, current version is 1.2.5
http://sourceforge.net/projects/gnuwin32/files/zlib/
threaded externally using Intel OpenMP library
ippsCRC32_8u optimized for SSE 4.1+
Default Compression Level Changed
now consistent with standard zlib distribution

- bzip2
www.bzip.org
based on the 1.0.4 version, current version is 1.0.6:
http://www.bzip.org/1.0.4/bzip2-1.0.4.tar.gz
 ippsCRC32_BZ2_8u optimized for SSE 4.1+
Define “IPP_PREFIX” to rename entry points
see bzlib.h include file
adds ipp_ prefix to library entry points (ipp_BZ2_name)
Multi-threaded by use of Intel OpenMP library
enabled by defining IPP_BZ_MT
defined by CFLAGS macro in Makefile
#if defined(IPP_BZ_MT)
int usr_limit_num_threads;
if( ippStsNoErr != ippGetNumThreads( &usr_limit_num_threads ) ) usr_limit_num_threads = 1;
#pragma omp parallel for num_threads(usr_limit_num_threads) reduction(|: ret)
#endif

- lzopack
www.lzop.org
based on 2.03 library and 1.02rc1 utility
http://www.oberhumer.com/opensource/lzo/download/lzo-2.03.tar.gz (LZO library V2.03)
http://www.lzop.org/download/lzop-1.02rc1.tar.gz (LZOP utility V1.02rc1)

- gzip
www.gzip.org
ipp_gzip Functionality Equivalent to Gnu gzip
clone due to GPL license
no library in bin directory – just an application
ipp_gzip adds Multi-threading Capability
> ipp_gzip file1 file2 file3…
compresses one file per hardware thread
> ipp_gzip a-very-huge-file.dat
“chunks” a single file over multiple hardware threads
-m option controls number of threads deployed
Multi-threaded using native threads
Windows CreateThread() or Linux pthread_create()
see vm/include and vm/src directories
enabled by defining GZIP_VMTHREADS
defined within Makefile “.c.obj” rules

Slide16.gif 

notes for slide #16:

Lib paths assumed to be defined, otherwise path needs to be specified directly or relative to $IPPROOT macro.

#if !defined( _IPP_NO_DEFAULT_LIB )
#if defined( _IPP_PARALLEL_DYNAMIC )
#pragma comment( lib, "ippdc" )
#pragma comment( lib, "ippcore" )
#elif defined( _IPP_PARALLEL_STATIC )
#pragma comment( lib, "ippdc_t" )
#pragma comment( lib, "ipps_t" )
#pragma comment( lib, "ippcore_t" )
#elif defined( _IPP_SEQUENTIAL_STATIC )
#pragma comment( lib, "ippdc_l" )
#pragma comment( lib, "ipps_l" )
#pragma comment( lib, "ippcore_l" )
#endif
#endif

Slide17.gif

Slide18.gif

Slide19.gif

Slide20.gif 

notes for slide #20:

Higher compression

JPEG XR file format supports higher compression ratios in comparison to JPEG, approximately 100:1 vs. 10:1
JPEG XR offers improved efficiency compared to JPEG, and the type of compression artifacts are often less objectionable than the typical JPEG compression artifacts.  JPEG XR offers a very wide range of compression levels, including perceptively lossless or mathematically lossless compression. 

Broader data ranges -More image formats

JPEG XR supports 8bpc (bits per channel), 16bpc and 32bpc, as well as several special bit depth formats.  Pixel values can be stored as either integers, scaled fixed point numbers or full floating point values; this provides full support for numerous high dynamic range (HDR) imaging scenarios, as well as support for wide gamut color spaces.  In addition to 3-channel RGB, JPEG XR supports monochrome, CMYK and n-channel formats up to 16 independent channels.  many of these formats also support an alpha channel.   This wide range of image formats allows for dramatically better image quality and allows this single new file format to effectively replace many previous formats that were required for specific scenarios.

Advanced decoding features

JPEG XR provides progressive decoding, allowing lower resolution previews or specific cropped areas to be displayed without the need to decode the entire image.  Additionally, JPEG XR images can be cropped, rotated, flipped and resized (within certain constraints) without ever needing to decode and then re-encode the image.  That means these operations are much, much faster and no additional image quality is lost due to the additional encoding steps
 
JPEG XR adds supports for 48-bit integer RGB (also known as Deep Color): It stores the values of each of the three channels as a 16-bit number, an integer number between 0 to 65,535, where 0 denotes least intensity and 65535 the greatest. Therefore, each channel stores a much finer grade of intensity. JPEG XR also support 16-bit integer CMYK color model and 16-bit integer grayscale

JPEG-XR support in the Intel IPP library

UIC JPEG-XR sample encoder and decoder with optional tile support for RGB color images with and without alpha channel and grayscale images with the following bit depths: 8 bits unsigned integer (8u), 16 bits signed (16s) and unsigned (16u) integer, 32 bits signed integer (32s), 16 and 32 bits floating-point (16f and 32f).
IPP library support for JPEG-XR forward and inverse core transforms for 16s, 32s and 32f data types and Variable length code (VLC) encode and decode for 32s data types.  

Enhancements of image processing functions

Enhancement request for double precision support in image processing routines
Add new data type support for image processing functions
Many more new functionality introduced in Image Processing domain.

JPEG XR Compression Rates

Lossless compression ratio of ~2.5x
Lossy compression ratios up to 100x

JPEG XR Improved Quality over JPEG

Comparable to JPEG 2000
High dynamic range up to 32 bits per color
Wide color spaces support: Gray, RGB, CMYK, YUV, RGBE

JPEG XR Computation Effectiveness

Faster and less complex than JPEG 2000
Even faster with sub-sampling 422/420
Reduced resolution for progressive decode
Native tiling allows for partial decode of images

Slide21.gif 

notes for slide #21:

- libjpeg
www.ijg.org
based on version 6b, current version is 8b
http://www.ijg.org/files/
http://packages.debian.org/source/lenny/libjpeg6b
http://ftp.de.debian.org/debian/pool/main/libj/libjpeg6b/libjpeg6b_6b.orig.tar.gz

Q: What threading mechanism was used to implement the UIC/JPEG performance improvements? OpenMP? TBB? Native? Other?
A: OpenMP is used in IPP 7.0 JPEG. JPEG-XR use TBB threading already.

Q: Does “up to 6x faster on 8 cores” for improved UIC/JPEG performance mean an 8-core machine or 8 hardware threads? If an 8-core machine, what machine was establish this 6x number?
A: We did our measurement on Nehalem system (4 hw threads plus 4 HT threads).

Q: What type of threading changes were made to UIC to yield such an impressive performance improvement?
A: Advantages are in parallel processing of JPEG restart intervals (which can be done independently). The limitation is that it only work for files compressed with JPEG restart intervals option. A short KB is also available.
JPEG new threading model in IPP 7.0
/en-us/
JPEG XR Codec support in Intel® IPP 7.0 Beta - an Introduction, features and advantages
/en-us/

Q: Is the new threading model used for encoding also used for decoding?
A: Yes, the new model will work for both encoder and decoder in baseline lossy compression mode. It has not yet been implemented for lossless mode.

Q: Does this multi-threading technique it mean the jpeg encoder will add many restart markers (RSTm) into the jpeg file?
A: For best encoder efficiency, set the restart interval equal to the number of MCU per image row (one RTS marker per image row).

Slide22.gif 

notes for slide #22:

JPEG old threading model – IPP 6.1 and earlier
- Based on parallel processing of one row of MCUs by each thread
- Each thread perform JPEG actions under own MCU row (CC, SS, DCT) in parallel, except VLC step.
- VLC can be done only in serial manner due to data dependency of MCU blocks for this operation
VLC is main challenge in parallel JPEG processing

IPP 7.0 JPEG new threading model
- JPEG standard allow to split data stream to Independently processed segments called Restart Intervals (RSTI).
Each restart interval contain a fixed number of MCUs
RSTI separated by restart markers (RSTm)
- New threading model based on parallel processing of this RSTI
Using RSTI allow to resolve main bottleneck of old model - existence of serial part in JPEG pipeline the VLC.
It can be achieved due to main property of RSTI – independency of MCU blocks of one RSTI from MCU blocks another RSTI.
This property allow to do all JPEG operation – CC, SS, DCT and VLC – for each RSTI by threads in parallel

Slide23.jpg 

notes for slide #23:

The Yonah microarchitecture, released in early 2006, represented a significant shift for the Intel processor family. It included introduction of the SSSE3 SIMD instruction set (Supplemental Streaming SIMD Enhanced instructions), a follow-on to the SSE2 and SSE3 instruction sets. Merom processors are represented by the popular Core 2 processors, which also support the SSSE3 processor instructions. Additionally, all Intel® Atom™ processors support the SSSE3 instructions.

At the time of this presentation (2010) processors through the Westmere family were shipping from Intel. Processors employing the Intel AVX instruction set are not yet shipping at this time. Processors based on the “Sandy Bridge” microarchitecture will be the first to employ the AVX instruction set.

Slide24.gif 

notes for slide #24:

The px/mx and t7 optimizations have been removed from the core 7.0 library. The g9/e9 optimizations represent significant additions and are tuned for the Intel Advanced Vector Extensions (Intel AVX) SIMD instructions. Optimizations for the Itanium family have been removed from the 7.0 library. “Generic” packages equivalent to the px/mx optimization layers in the 6.1 version of the library will be released in by the end of this year (2010). The “generic package” is expected to contain a PX/MX dynamic lib (can be dispatched) and a PX/MX static lib (no dispatching) for both Linux and Windows.

The Speech Recognition domain has been removed from the 7.0 library. Instructions will be provided to show how to make a custom dll/so library based on the 6.1 library for use with the 7.0 library.
 
AMD and other non-Intel 32-bit processors will generally use the w7 optimization (SSE2). AMD and other non-Intel 64-bit processor will generally use the m7 optimization (SSE3). The remaining optimizations will run only on Intel or fully compatible processors, but only if the processor supports the SIMD instructions required by those optimizations.
The library always attempts to dispatch to the highest optimization level possible, based on data reported by the processor’s CPUID instruction. When evaluating the CPUID return data, the library looks for indication of support for the respective SIMD instructions (SSE2, SSE3, etc.), not the processor's manufacturer code. If a non-Intel processor exists that also supports SSE4.1, for example, and reports it in an Intel-compatible way, then that processor will be dispatched to the p8 or y8 optimization layer.

Slide25.gif 

notes for slide #25:

easier upgrade and integration with the compiler product
Intel is compiler not required to use the Intel IPP library!
common library components (such as OpenMP) included in compiler directories
side-by-side installs still possible (more then one version of the library can be installed)
Linux layout is always side-by-side, uses soft-links into the “default” directory
may see directories from other products, such as MKL, TBB, and, of course, the compiler

Windows: User specifies upgrade or side-by-side (SxS) during installation
Linux: Always side-by-side
Symbolic links /opt/intel/ComposerXE-2011
Compiler -> /opt/intel/ComposerXE-2011/Compiler
MKL -> /opt/intel/ComposerXE-2011/MKL
Common, always installed compiler based runtime (OMP, SVML, LIBIRC).
Updates with compiler update

Slide26.gif 

notes for slide #26:

Align with Intel compiler and de-facto rules in use today.

/en-us/articles/ipp-70-beta-selecting-the-intelr-ipp-libraries-needed-by-your-application

Slide27.gif 

notes for slide #27:

A comparison of the Intel IPP library with the equivalent functions from the FrameWave open source library, both running on the indentical AMD processor. Our goal is to be your number one choice for performance software on the x86 platform.

Slide28.gif 

notes for slide #28:

software.intel.com/en-us/articles/intel-xe-product-comparison

software.intel.com/en-us/articles/which-version-of-the-intel-ipp-intel-mkl-and-intel-tbb-libraries-are-included-in-the-intel-composer-bundles

Slide29.gif 

notes for slide #29:
www.intel.com/software/products/ipp
software.intel.com/en-us/articles/intel-ipp-kb/all
software.intel.com/en-us/forums/intel-integrated-performance-primitives
premier.intel.com
software.intel.com/en-us/articles/buy-or-renew
software.intel.com/en-us/articles/intel-software-evaluation-center

Slide30.gif 

For more complete information about compiler optimizations, see our Optimization Notice.