Deferred Mode Image Processing Framework: Simple and efficient use of Intel® multi-core technology and many–core architectures with Intel® Integrated Performance Primitives

Introduction

In recent years, the resolution of image sensors has increased significantly, but image processing algorithms have not improved in efficiency. The bottleneck for nearly all image processing algorithms is memory access, and access time increases significantly when image data resides outside the L2 cache. Even with more computing power available, the increased size of the images impacts the ability to achieve high performance due to the increased number of cache misses. To improve performance of image processing applications, developers need to ensure data is kept in L2 cache as long as possible, and that parallelize image processing execution uses a slice size comparable to the L2 cache size. The Deferred Mode Image Processing (DMIP), available with the Intel® Integrated Performance Primitives (Intel® IPP) product, provides an efficient software solution for this dilemma.

DMIP provides both data parallel and task parallel frameworks built on Intel® Integrated Performance Primitives for full utilization of the computational power available on the modern multi-core and many-core Intel architectures for image processing tasks.


DMIP in IPP 6.1

Intel® IPP version 6.1 introduces the DMIP component's support for the following image processing operations:
  • Basic arithmetic unary and two-place operations (+, -, *, /, Abs, Ln, Min, Max, Sqrt etc)
  • Logical operations (&, |, ~, etc)
  • Thresholding operations
  • Image type and channels conversion
  • Color conversion
  • Statistical operations
  • Image filters:
    • General filters
    • Special filters (Box, Min, Max, Median, etc)
    • Fixed kernel filters (Sobel, Prewitt, Schar, etc)
    • Morphology (erosion, dilation)
  • Linear transform DFT/FFT
  • Polyadic operations with up to 5 arguments
DMIP has more then thirty built-in nodes to support these operations, which translates to around 1800 atom operations (IPP functions) across data types and color channels for image processing. In addition to the built-in nodes, developers can add user-defined nodes to DMIP.

To see how DMIP works, let's consider a simple image processing task, edge detection using differentiation operators:



where I(x,y) is the source image and e(x,y) is the destination image with detected edges. For this case the source image is a 3-channel 8-bit image and the output image is an 8-bit grayscale image which requires additional color conversion of the source image. For calculation of partial derivatives, we will use Sobel operators.

As a first step, consider the following IPP code for the Sobel operators' edge-based detector:

IppiSize roi; // Input and output image sizes
Ipp8u* pSrcImg; int srcImgStep; // Input C3, 8u image in IPP image format
Ipp8u* pDstImg; int dstImgStep; // Output C1, 8u image in IPP image format

Ipp8u* pGrayData; int grayDataStep; // Input grayscale image

Ipp16s* pIntBuffer; int IntBufferStep; // Intermediate buffer

Ipp8u* pSBuffer; int SBufferSize; // Sobel operators temporary memory

// Allocate memory for grayscale converted image
pGrayData = ippiMalloc_8u_C1(roi.width,roi.height,&NewSrcStep);

// Calculate and allocate temporary filters memory
ippiFilterSobelHorizGetBufferSize_8u16s_C1R(roi,ippMskSize3x3,&SBufferSize);
ippiFilterSobelVertGetBufferSize_8u16s_C1R(roi,ippMskSize3x3,&SBufferSize1);

SBufferSize = SBufferSize < SBufferSize1 ? SBufferSize1 : SBufferSize;

pSBuffer = ippsMalloc_8u(SBufferSize);

// Allocate memory for intermediate buffer. 
// Its size is twice the image size and is used for dx and dy storage.
pIntBuffer = ippiMalloc_16s_C1(roi.width,roi.height*2,&IntBufferStep);

// Convert input image into grayscaled image
ippiRGBToGray_8u_C3C1R((const Ipp8u*)pSrcImg, srcImgStep, pGrayData, grayDataStep, roi);

// Calculate dx
ippiFilterSobelHorizBorder_8u16s_C1R((const Ipp8u*) pGrayData, grayDataStep,
               pIntBuffer, IntBufferStep, roi, ippMskSize3x3, ippBorderRepl, 0, pSBuffer);
// Calculate dy
ippiFilterSobelVertBorder_8u16s_C1R((const Ipp8u*) pGrayData, grayDataStep,
               pIntBuffer+roi.height*(IntBufferStep/2), IntBufferStep, roi, ippMskSize3x3,

               ippBorderRepl, 0, pSBuffer);

// Take absolute values of dx and dy.
ippiAbs_16s_C1IR(pIntBuffer, IntBufferStep, roi);
ippiAbs_16s_C1IR(pIntBuffer+roi.height*(IntBufferStep/2), IntBufferStep, roi);

//Add them. Edges are detected.
ippiAdd_16s_C1IRSfs((const Ipp16s*) pIntBuffer+roi.height*(IntBufferStep/2), IntBufferStep,

					pIntBuffer, IntBufferStep, roi, 0);

// Covert image with edges in dst image format
ippiConvert_16s8u_C1R( pIntBuffer, IntBufferStep, (Ipp8u*)pDstImg, dstImgStep, roi );

// Free resources
ippiFree(pGrayData);
ippiFree(pIntBuffer);
ippsFree(pSBuffer);

Figure 1. IPP Sobel edge detector code

This approach uses the highly optimized, parallelized IPP operations for processing the whole image. Since two intermediate buffers are required for storing dx and dy partial derivatives, the intermediate memory requirement grows proportional to the image size. Each operation is performed independently, and internal threads in IPP are synchronized after each operation has completed. In the Sobel edge detector above in Figure 1, there are seven sync points per task and a significant number of cache misses.

To minimize cache misses and increase performance, we will execute the same task with DMIP. Figure 2 shows a task graph of the Sobel edge detector. The circles in the graph designate operations, while the edges indicate data flow.



Figure 2. Sobel edge detector task graph

Figure 3 shows DMIP code in symbolic API.

Image Src(pSrcImg, Ipp8u, IppC3…); // Source image in DMIP format
Image Dst(pDstImg, Ipp8u, IppC1…); // Destination image in DMIP format

Kernel Kh(idmFilterSobelHoriz,ippMskSize3x3,ipp8u,ipp16s); // Dx operator
Kernel Kv(idmFilterSobelVert,ippMskSize3x3,ipp8u,ipp16s); // Dy operator

Graph O = ToGray(Src); // To get detected as common expression.

// Compile end execute task
Dst = To8u(Abs(O*Kh)+ Abs(O*Kv));


Figure 3. The Sobel edge detector, DMIP implementation

By default, the DMIP engine determines the optimal slice size to fit into L2 cache. In some cases, the optimal slice size can be changed according to preferences set by the application developer. Figure 4 shows DMIP code in symbolic API with separate compilation and execution processes to control slice size and limit the number of threads, with varying input image.

Image Src(pSrcImg, Ipp8u, IppC3…); // Source image in DMIP format
Image Dst(pDstImg, Ipp8u, IppC1…); // Destination image in DMIP format

Kernel Kh(idmFilterSobelHoriz,ippMskSize3x3,ipp8u,ipp16s); // Dx operator
Kernel Kv(idmFilterSobelVert,ippMskSize3x3,ipp8u,ipp16s); // Dy operator

Graph O = ToGray(Src); // To get detected as common expression.

// Build task graph
Graph G = (To8u(Abs(O*Kh)+ Abs(O*Kv))) >> Dst;

// Compile
G.Compile(slice size);
// Execute task
G.Execute(threads number);


Figure 4. The modified DMIP Sobel edge detector

DMIP uses a compiler to split an image into small data portions (slices), then calls the same IPP functions as in Figure 1. Compilation and execution processes are described in the paper: Deferred Image Processing in Intel® IPP Library (presented at the 2008 Computer Graphics and Imaging Conference) and in this article: A Landmark in Image Processing: DMIP).

Let us review the image processing task. Image processing begins with a filtration (Flt operation), changing the filtered data type to unsigned char (To8u operation). "Src" refers to the operation of reading from source image buffer, and "Dst" refers to the operation of writing to the destination image buffer. A parallel execution algorithm splits a slice into sub-slices to allow parallel processing of the sub-slices on the available CPUs, as shown on Figure 5. DMIP does not analyze thread dependencies, so the threads must be synchronized after each operation to avoid a data race condition.



Figure 5. Parallel slice execution.


Graph optimization technique

Performance analysis on this DMIP algorithm showed that parallel slice execution shown in Figure 5 worked well on systems with 2 cores. However, due to the number of sync points, this example of parallel slice execution did not perform well on systems with more than 4 cores. In the previous example, during parallel slice execution there are 9 sync points (8 intermediate and 1 between slices) per slice and 9* slice number per task. By using DMIP's graph optimization technique for whole algorithm analysis, sync points were reduced or eliminated, improving performance significantly on multi-core and many-core systems.

DMIP uses an internal representation of the graphs to analyze a whole task, providing four levels of graph optimization. Each level is designed for specific aspects of workloads and, in general, is targeted to reduce number of sync points. DMIP was designed to execute an arbitrary image processing task represented as a directed acyclic graph with no restriction on the task graph while developing the optimization modes.

The DMIP graph optimization support several modes:

0. Normal mode - simple parallel execution. See Figure 5.

1. Light optimization mode - extraction of a branch with linear chains of nodes.

2. Medium optimization mode - translation to the graph multilayer form representation to extract independent or locally independent operations. This mode provides operations for task parallelization.

3. High optimization mode - merging of graph layers.

4. Aggressive optimization mode - removing sync points between slices.

In the normal mode (0), a task is executed "as is", i.e., sequentially with synchronization after each operation.

In the light optimization mode (1), the DMIP graph compiler extracts the chains of nodes and combines them into a container, which we will call 'SuperNode'. The nodes in the extracted linear chain are executed without intermediate synchronization, reducing the number of nodes and the number of sync points. For the Sobel edge detector example, 4 linear chains are extracted:

1. Src->ToGray
2. Sv->Abs
3. Sh->Abs
4. Add->To8u->Dst

There is one restriction: a filter operation must be the first node in a linear chain, which is needed to prepare the border pixels before filtering the fully processed data slice on subslice level.

In the medium optimization mode (2), the DMIP graph compiler uses linear chains to extract locally independent operations, i.e., a layer. All nodes on the same layer are independent, so each can either be executed in 1) parallel (task parallelism) or 2) sequentially but without sync points between them. Task execution strategy depends on target micro-architecture. Mode (2) introduces two major task characteristics: task width and task height.

Definition. Task height - number of layers.
Definition. Task width - maximum number of the nodes inside a layer.
Definition. Thin Task - task width is equal to 1.

A wider, thinner graph is more parallel, having less sequential code. Thin tasks cannot be accelerated using mode (2) because there are no parallel regions inside such tasks. For the Sobel edge detector example, the task height is 3 and the task width is 2. First SuperNode Src->ToGray belongs to the first layer. The second and third SuperNodes Sv->Abs and Sh->Abs belong to the same layer, forming two task parallel regions (see Figure 2). Last SuperNode Add->To8u->Dst is in third layer.

Other two graph optimization modes (high and aggressive) are generally targeted for acceleration of "thin" tasks. During execution in these modes the use of a filter operation in the task is analyzed to decide which layers can be merged and whether the slices can be executed without intermediate thread synchronization.

DMIP API to control graph optimization mode:

// Set desired desired opimization mode
idmStatus Control::SetGraphOptLevel(int level);
// Retrieve current graph optimization mode
int Control::GetGraphOptLevel(void);


The performance benefit that can be achieved in each DMIP mode depends on the workload's graph configuration. In some cases, especially in pixel-wise workloads, the task can be executed without any intermediate sync points. Figure 6 compares the numbers of synchronization points in the Sobel edge detector implemented for each DMIP optimization mode. Figure 7 shows speed-up of the Sobel edge detector implemented with the aggressive optimization mode.

Mode Number of sync points
Normal 9 per slice
Light 4 per slice
Medium 3 per slice
High 2 per slice
Aggressive 1 per slice

Figure 6.
Intermediate sync points per mode



Figure 7. Speed-up of the Sobel edge detector using aggressive optimization mode

In the case of a grayscale input image we see improved performance in the Sobel-based edge detector if we prepare the image border pixels before graph execution, as shown below.

Image SrcI(pSrcImg, Ipp8u, IppC1…); // Source image in DMIP format
Image DstI(pDstImg, Ipp8u, IppC1…); // Destination image in DMIP format

Kernel Kh(idmFilterSobelHoriz,ippMskSize3x3,ipp8u,ipp16s); // Dx operator
Kernel Kv(idmFilterSobelVert,ippMskSize3x3,ipp8u,ipp16s); // Dy operator

SrcI.CopyBorder(…);// Create image border.
Graph O = Src(SrcI); // To get detected as common expression.

// Compile end execute task
DstI = To8u(Abs(O*Kh)+ Abs(O*Kv));

In this example, the aggressive optimization mode executes the entire task without any intermediate sync points because the filter node, which contains pre-calculated border pixels, executes immediately after SrcNode.


DMIP Task examples

The IPP DMIP component includes a simple console-based example "dmip_bench" which demonstrates DMIP benefits over traditional task implementation with IPP. One of the DMIP workloads is considered from the optimization point of view below.

Harmonization filter



Where is a box filter, and are thresholds and c is a constant.

DMIP code:

Image A(…); // Source image in DMIP format
Image D(…); // Destination image in DMIP format

Kernel K(idmFilterBox,…); // Box filter kernel
Ipp32f c;
Ipp8u Tmax, Tmin;

Graph O = To32f(A); // To get detected as a common expression.

// Compile end execute task
D = Max(Min(To8u(O-(O-O*k)*c),Tmax),Tmin);


This task has 3 linear chains; the number of nodes inside the longest linear chain is 5 operations. Task width is 1 and task height is 4, resulting in a thin task that has no locally independent operations, so medium optimization mode provides no benefit. Since the box filter resides in the second layer, inter-slice optimization cannot be done. Using the high optimization mode, sync points after the box filter operation were removed, resulting in 1 sync point per slice with the task speed-up shown in Figure 8.



Figure 8. Speed-up of the Harmonization filter with high optimization mode

Exponential brightness correction

, where , and are constants.

This task has 1 linear chain that incorporates all operations in the graph. Task width is 1 and task height is 1, and as in the previous task, this is a thin task that will see no benefit from medium optimization mode. This is a pixel-wise task, so there are no inter-slice dependencies. Using the aggressive optimization mode there were 0 sync points per slice and a task speed-up shown in figure 9.



Figure 9. Speed-up of exponential brightness correction operation using aggressive optimization mode

Sepia toner

A sepia filter operation calculates the pixel values according to the following formulae:



Where R, G, B is a pixel color triplet. , and - color transformation coefficients, , and - sepia tone colors.



Figure 10. Original (on the left) and sepiaized image (on the right).

This task has 1 linear chain which incorporates all 3 operations in the graph. Task width is 1 and task height is 1, so this thin task cannot benefit from the medium optimization mode. This is a pixel-wise task, so there are no inter-slice dependencies. Aggressive optimization mode results in zero sync points per slice. In this task, the slice size (in bytes) changes from operation to operation during image transformation from RGB to grayscale, then back to RGB color format. For each slice, half of the operations use one-third of the available cache size.



Figure 11. Speed-up of Sepia toner using aggressive optimization mode


Separable 2D FFT

Not all image operations are ready "as is" to be processed efficiently in the DMIP framework; DMIP supports various techniques for efficient restructuring of operations. For example, histogram calculation requires extraction of the prologue, accumulation, and epilogue phases. 2D FFT transform in DMIP is implemented in a separable way, consisting of three parts: 1D FFT calculation of the rows, transposition, and 1D FFT calculation of the columns as transposed rows. This computation structure utilizes pipelining to reduce overall jumps in memory during columnar FFT calculations. Figure 12 shows the performance of the DMIP 2D FFT transform in comparison with threaded IPP primitive.



Figure 12. Speed-up of the 2D forward FFT transform, IPP and DMIP+IPP versions, CPU clocks per element (fewer clocks = better performance)


Architecture topology utilization

For efficient operation, DMIP utilizes the available information detected for the microprocessor, i.e. CPU type, number of physical cores, Intel® Hyper-Threading™ Technology (Intel® HT Technology) enabled, number of cores sharing the same cache, etc. For example, the slice size is configured using knowledge of the cache size and the number of threads that share the same cache. DMIP is always balancing between data evicting from the cache vs. the full cache and core utilization. Thread affinity is used to minimize the cost of thread context save/restore and OS thread migration. Each working thread is assigned to a corresponding core to minimize data migration from cache to cache. For example, 2x Intel® Core™2 Quad processor-based architecture has 8 physical cores, so DMIP working threads are assigned to the cores in consecutive order.

The Intel® Core™ i7 microprocessor has 4 physical cores with Intel® HT Technology enabled, so 8 hardware threads are supported. However, with this microprocessor, DMIP only uses 4 threads because DMIP is already highly optimized, so the additional Intel® HT Technology threads do not increase performance.

Intel plans an API for Intel® IPP topology extraction and utilization for a future release of Intel® IPP.


Many-Core Architecture Challenges

Future Intel® architectures will have more than 8 cores. Video cards with many cores and threads currently being developed, as well as other many-core architectures, will require a slightly different parallelization approach since they support significantly improved thread synchronization and DMIP internal processing data unit type: slice or tile. When pipelining with slice processing may not be suitable for some tasks on many-core architectures, DMIP can extract a parallel region in the task and utilize all threads for execution of the parallel region to achieve best performance.

For thin, high tasks, it is more efficient to transition from slice-based pipelining to tile-based pipelining since the number of sync points is minimal and equal to task height. A tile as an atomic data unit that increases the number of independent units processed and requires implementation of dynamic thread load-balancing.


Conclusion

The Intel IPP DMIP library provides a solution for effective multi- and many-core architecture utilization for image processing tasks. It incorporates algorithm analysis for task acceleration on many-core architecture, utilizing all available cores in the most efficient way. DMIP also provides an intuitive API to simplify development, optimization and parallelization of image processing applications.


Online Resources

Optimization Notice in English

About the Authors

Igor Belyakov is a software engineer in Intel's Integrated Performance Primitives Engineering team within the Software and Services Group (SSG). Igor is a technical leader of the DMIP project specializing in automatic optimization and parallelization of image processing and computer vision algorithms on multi-core and manycore architectures. Igor also worked in the Speech Coding Team and is an expert in speech coding and in the real-time media data transmission protocol stack. He holds a MS in Mathematics from the Moscow State University, Mathematic and Mechanical Department.

Bonnie Aona is a software engineer in the Intel Compilers and Languages Group within the Software and Services Group (SSG) working on optimizing and testing applications to take advantage of the latest Intel software and hardware innovations to achieve high performance and parallelism. Bonnie's career leverages Software Quality Assurance and program management with software design for complex high performance applications for computer graphics, real-time systems, scientific research, manufacturing, e-Commerce, aerospace and healthcare. She holds Masters degrees in Electrical and Computer Engineering from University of California at Davis.




Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione