Image Processing Acceleration Techniques using Intel® Streaming SIMD Extensions and Intel® Advanced Vector Extensions

Introduction

Modern Intel processors features acceleration through the use of SIMD (Single Instruction Multiple Data) instructions that include a wide range of available Intel® Streaming SIMD Extensions (Intel® SSE) instructions and the new Intel® Advanced Vector Extensions (Intel® AVX) instructions. Image processing data structures and algorithms are often suitable candidates for optimizations using these instruction sets. Combined with the Intel® C++ Compiler’s ability to autovectorize loops it provides for an efficient method of achieving improved performance in applications dealing with image processing.
In this paper we will detail some well known transformation techiniques with code examples illustrating how to take advantage of Intel® SSE and Intel® AVX to transform image data, together with compiler autovectorization of image processing algorithms. The paper details optimized implementations (using varying data types and sizes) of data transformations and algorithms together with analysis comparing performance and providing speedup measurements for Intel® SSE optimized code and estimates for Intel® AVX optimized code.

Intel® AVX is a 256 bit instruction set extension to Intel® SSE and is designed for applications that are floating point intensive. Intel® AVX extends all the 16 XMM registers to 256 bits YMM registers, doubling the register width leading to improved performance and power efficiency over the 128 bit SIMD instructions. Use of Intel® AVX also results in fewer register copies, more efficient register use and smaller code size.

Using the proposed techniques we can achieve good performance speedup as can be seen in the performance speedup summary below.

Filter Intel® SSE Speedup Intel® AVX Speedup

Sepia (int base)

2.6x

3.1x

Sepia (float base)

1.9x

2.2x

Crossfade (int base)

2.7x

3.6x

Crossfade (float base)

1.9x

2.4x

Measured for Intel® Core™ i7 processor with recommended chunk size of ~50000 pixels. Note that Intel® AVX performance was estimated using a simulator, and that it does not take into account future architecture improvements.

Overview


The code examples provided in this article assumes use of Intel® C++ Compiler and requires basic knowledge of SIMD, Intel® SSE instruction intrinsics and how to perform auto vectorization. Compiler features, options and pragmas apply to the use of Intel® C++ Compiler 11.1.35 or later which supports new instructions sets such as Intel® AVX.
The code examples are in C++ and were built and analyzed on Microsoft Windows* (Vista and XP).

Scope and assumptions:

  1. Images are represented by uncompressed RGBA pixel values where each color channel is represented by either an integer (8 bit) or a float (32 bit)
  2. To simplify conversions, color values are represented by a number from 0 to 255, stored in either 8 bit integer or 32 bit float
  3. Data is aligned to 16 bytes except for the processing involving use of Intel® AVX where 32 byte alignment is used

Notes on performance/speedup:

  • Performance of functions using Intel® AVX are estimates since Intel processors featuring Intel® AVX are not yet available. Architecture emulator (Intel® Software Development Emulator) and simulator (Intel® Architecture Code Analyzer)were used to verify behavior and estimate Intel® AVX performance
  • The actual performance depends on processor architecture, cache configuration and size, frequency etc.
  • Only a few image processing filters (algorithms) are presented in the paper. Performance speedup applicability to other filters depends on the filter complexity and inter pixel dependencies. There are no guarantees of improved performance utilizing the discussed techniques on other filters

Download PDF

To read the rest of this article, download by clicking: here (pdf size: 1MB)

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.