# Improving the Compute Performance of Video Processing Software Using AVX (Advanced Vector Extensions) Instructions

由 Eli Hernandez 发布于 2012 年 2 月 8 日

## Download Article

Download Improving the Compute Performance of Video Processing Software Using AVX (Advanced Vector Extensions) Instructions [PDF 311KB]

## Abstract

Modern x86 CPUs permit instruction level parallelism (e.g. SIMD) on register vectors at most 128-bits. Second Generation Intel® Core™ Processors include the first generation of AVX (256-bit operators), which permits increased parallel processing. This paper outlines a case study in which AVX instructions are used to improve the compute performance of a de-saturation algorithm. The paper also discusses how future integer based AVX instructions might be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms.

## 1. Introduction

Modern x86 CPUs permit Instruction Level Parallelism (ILP), such as Single Instruction Multiple Data (SIMD), on vectors at most 128-bit. These register vectors can be used to process multiple data elements with fewer instructions. Second Generation Intel® Core™ Processors (codenamed Sandy Bridge) included the first generation of AVX, which is a 256-bit instruction set extension to the Intel® Streaming SIMD Extensions (Intel® SSE).

The first generation of AVX included a wide range of instructions designed primarily to accelerate compute intensive algorithms performing arithmetic operations on floating point data. However, even if an algorithm is integer based, using AVX instructions could potentially increase an algorithm’s performance without sacrificing accuracy of the results. In video processing algorithms, the pixel channels are often stored as 8-bit unsigned integers (bytes) and processed as 32-bit or larger format integer values. Therefore, most video algorithms require conversion of pixels to and from a format. Wider bit widths are used for calculation accuracy and smaller formats are used to save space. Typically, floating-point units are not used because the extra conversion costs do not significantly improve accuracy. However, AVX is capable of greatly improving the runtime performance of video processing software and a vast number of other software applications by the increased parallelism.

This paper describes a case study in which AVX instructions are used to enhance the performance of a de-saturation algorithm (a common video filter). The case study takes the algorithm from a non-SIMD state to AVX based SIMD. The paper also discusses how future generations of AVX may be able to further aid performance optimization and enable greater performance of video processing.

## 2. Intel SIMD Overview

On Intel SIMD architectures, a vector register can store a group of data elements of a single data type (e.g. floats or integers). The vector registers of Sandy Bridge are 256 bits wide whereas all other processors since Intel® Pentium III were 128 bits wide. Each vector (called YMM in Sandy Bridge) register can store 8 floats, 8 32-bit integers, 32 chars, etc. AVX instructions operate on the full 256 bits, but SSE can only operate on 128 bits.

A SIMD enabled-processor can execute a single operation on multiple data. An operation performed simultaneously on multiple data elements is a vector process. SIMD vectorization is the process of converting an algorithm from a scalar to a vector implementation. The multiply function in sample code below is used to illustrate the difference between the scalar and SIMD vector process.

**Figure 1:**

*This illustrates the difference between scalar and vector processes. The scalar version would have 16 loads, 8 multiplications and 8 stores. SSE can potentially have 4 loads, 2 vector multiplications and 2 stores. AVX would use 2 loads, 1 large vector multiplication and 1 store. The labels with VMUL were shortened to hide the distinction between various versions of vector multiplication instructions. VMUL performs multiplication on vectors A and B for each element pair and stores the results in another vector. Let us suppose for simplicity that loads and stores cost 3 cycles, all multiplication costs 1 cycle and we are ignoring pipelining. Then the scalar version spends 80 cycles to compute 8 elements while the AVX version spends 10 cycles, yielding a theoretical speedup of 8x. This clearly illustrates why SIMD vectorization has become a very important aspect to optimize application performance. Also given observed performance benefits with SIMD, automatic SIMD vectorization has become as keystone feature in advanced compilers.*

## 3. Video Processing Code

Typical video processing algorithms calculate pixel values using a triple for-loop (for each frame, for each X, for each Y). This typically is seen as an area of high CPU utilization (i.e. hotspot). Video processing application hotspots are excellent candidates to optimize with AVX.

A simple approach to optimize using SIMD involves taking advantage of the latest processor technology features, such as AVX. The following sections describe the optimization process using AVX instructions to enhance the performance of a de-saturation algorithm. The serial code implementation is briefly discussed and AVX-based SIMD instructions are used to optimize the de-saturation algorithm. Finally, this chapter ends with our performance results of the optimize code.

## 3.1. Desaturation - Sample Code

The typical implementation of the Desaturation algorithm uses the incoming pixel values to compute a luminance value. The luminance value is applied to all outgoing pixels to de-saturate images as part of processing video for output.

As you can see in the sample code below the algorithm traverses row by row to get pixel data, which channel values (blue, green, and red) are used to calculate the luminance value. In other to achieve high accuracy the algorithm converts the one-byte channel values to single precision floating point. The floating-point values are used in a dot product type of operation to compute the luminance value. The Desaturation sample algorithm uses the fLuminace(…) function to convert pixel channel values from byte to float. The conversion to float is achieved implicitly by typecasting each channel value to float and with weights as constants for Red, Green and Red, the fLuminance(…) function uses the float values to compute luminance which value is applied to the video output.

Note that the conversion of channel data from byte to float occurs implicitly by typecasting to float. Although the scalar code looks simple and trivial, the assembly code generated by the compiler is much more complex. In analysis of the generated assembly code, the implicit byte to float conversion can be performed with fewer instructions by using the more efficient AVX instructions. As we have observed, the serial code calculates one channel and one pixel at a time. Nothing is computed in parallel (ignoring pipelining and reordering). Refer to **Appendix A** for the assembly code.

## 3.2. Desaturation - Optimization with AVX

This section outlines the transformation of the serial code and describes how AVX, SSE4.1 and SSE2 instructions optimize the de-saturation algorithm. As illustrated in Chapter 2 with SIMD, we can work on many items at once. Therefore, the load, store, conversion and math operations can be done in parallel. The algorithm below describes how we can use instruction level parallelism (via AVX instructions) to significantly improve performance. Note that the algorithm is written with the restriction that we could only use available instructions, not idealistic for future instructions as we discuss later. Therefore, lines 19, 20 and 21 involve an intermediate step to convert 32-bit integers back down to 8-bit unsigned and etcetera.

**Figure 2.**

*De-saturation algorithm*

With the Figure 2 as the backbone of de-saturate, we can implement the real code. The motivations for using a procedure similar to Figures 2 and 3 are that:

- AVX provides greater throughput for parallel processing of single-precision floating- point units than any past Intel SIMD x86 extension (MMX, SSE, SSE2, SSE3, SSE4.1, SSE4.2).
- The cost (SIMD) to cast byte (8-bit unsigned char) to integer (32-bit signed integer) to single precision floating point (32-bit float) and back is less than using multiple calls of the equivalent code (scalar) using just bytes or integers.
- Using byte based SIMD with this procedure gives poor precision. Parallel performance is not considered.
- Using integer based SIMD with this procedure gives acceptable precision. Current AVX instructions for integer arithmetic do not exist and therefore cannot take full advantage of the 256-bit registers.
- Using float based SIMD with this procedure gives very good precision and offers higher performance than those described above.

**Figure 3. ***De-saturation code optimzed AVX ^{1}*

The algorithm and AVX code shown in Figures 2 and 3 convey the same exact process line-for-line. Notice that only lines 9 and 16 involve doing the real work. Theses lines each process 8 single precision floating point multiplications in parallel, totalling 16 multiplications for 2 instructions versus 16 individual multiplication instructions. Everything else is unnecessary overhead to make use of the parallel instructions or to increase precision.

Despite the overhead, this code still improves performance by 1.45x . If integer based instructions existed with equivalent parallelism to that of single precision floating point, we could further increase performance. In such case, lines 6, 8, 10, 11, 14, 15, 17 and 18 could be eliminated. Lines 9 and 16 would operate on integers instead. Lines 19, 20 and 21 could require a single pack instruction (integer to byte). Of course, there are other hypothetical instructions that could be introduced with future AVX generations. The potential performance gain is left as an exercise for the reader. For assembly instructions generated by intrinsic functions used in the inner [ix] loop, refer to

**Appendix B**.

## 3.3 Desaturation - Performance Test Results^{3}

^{3}

Performance assessment of the de-saturation algorithm optimized with AVX in this study observed a 1.45x speedup when compared to the serial code. To gather performance data the de-saturation algorithm was applied to a 1440x1080 image and was looped 100 times. Performance was measured in elapsed time (milliseconds) taken to de-saturate the image, the following performance numbers were consistently observed:

Serial Code: | 1264 milliseconds |

Code with AVX: | 873 milliseconds |

Performance Scaling: | 1.45x or 1264ms/873ms |

A kernel (small application program) was used to run the algorithm. A kernel with a 1.45x scaling typically translates to a performance improvement of 10% to 15% when measured at the workload level. However, for video processing this rule of thumb does not apply. Consider you are applying the de-saturation algorithm to a one-minute or longer video clip. In that case, there will be more than 100 frames (images) to process. In theory, since more data has be processed, the performance boost potential could be more than 1.45x especially if or when processing full High-Definition (e.i.,1920x1080) video.

## 4. Packed Integer Conversion Instructions

Since our optimized de-saturation algorithm uses one of the SSE4.1 instructions, we will give an overview of SSE4.1 because other SSE4.1 instructions may be applicable for the optimization of other video processing algorithms. The Packed Integer Conversion instruction set contains 12 instructions for packed integer bit width conversions. Any of which can be utilized to optimize code where bit width is to be increased for integer data.

The table in **Figure 5** lists the SSE4.1 instructions for packed integer conversions. The instructions support sign extension and zero extension conversions of byte to word, byte to double word, byte to quad-word, word to double word, word to quad-word, and double word to quad-word. Additionally, the chart shows a comparison of SSE2 vs SSE4.1 instructions needed to convert four (4) one-byte integers to four (4) 32-bit integers.

The *pmovzxbd (byte to double word)* instruction was utilized a total of four (4) times in the de-saturate optimization. When/if these instructions include support for full 256-bit register, the use of this instruction in the optimized algorithm will be reduced to two (2). Thereby further improving the loop performance.**Figure 4. ***Instructions for bit width conversions of packed integers*

The source operand to packed integer conversion instructions is from either an XMM register or memory. The destination is always an XMM register. When accessing memory, no alignment is required, unless alignment checking is enabled. In which case, all conversions must be aligned to the width of the memory being referenced. The number of elements that can be converted and width of memory reference is illustrated in **Figure 5**. The alignment requirement is shown in parenthesis.**Figure 5. Number of elements to process.***P is Packed. MOV is Move (copy register). ZX is Zero Extend. SX is Sign Extend. B is Byte. W is Word. D is Double Word. Q is Quad-Word.*

## 5. Conclusion

This paper has discussed how Second Generation Intel® Core™ Processors could increase parallel processing via AVX instructions and 256 bit registers. This paper outlined a case study in which AVX instructions were used to improve the compute performance of a de-saturation algorithm. The paper also discussed how future integer based AVX instructions could be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms. The procedure described demonstrated how AVX instructions or their intrinsic functions could be utilized to improve the runtime performance of video processing applications. The paper documented that despite some overhead incurred to setup for SIMD processing, the video de-saturation still achieved excellent performance benefits.

## About the Authors

Eli Hernandez is an Application Engineer in the Consumer Client and Power Enabling Group at Intel Corporation where he works with customers to optimize their software for power efficiency and to run best on Intel hardware and software technologies. Eli joined Intel in August of 2007 with over 12 years of experience in software development for the telecom and the chemical industry. He received his B.S. in Electrical Engineering in 1989 and completed Master Studies in Computer Science in 1991-1992 from the DePaul University of Chicago.

In 2008, Larry Moore graduated from Saint Petersburg College with Honors. He received a Who's Who Among Students Award and was a member of Phi Theta Kappa Honor Society. In 2011, he spent 8 months at Intel as an application engineer intern, in DuPont, Washington. Currently, he is attending the University of South Florida at Tampa, Florida in an accelerated graduate program, pursuing both a Bachelor of Science and Master of Science in Computer Engineering. His current research involves computer aided verification of real-time systems and model checking. Larry is also a member of IEEE Computer Society.

## Appendix A: Inner loop quivalent assembly of the serial code

Roughly 45 instructions to proccess an iteration of the algorithm inner loop. With throughput of 1 pixels processed per iteration.

## Appendix B: Equivalent assembly of inner loop optimized with AVX

Roughly 30 instructions to proccess an iteration of the algorithm inner loop. With throughput of 4 pixels processed per iteration.^{1} Load and store operations on optimized code assumes data is aligned.^{2} Please see footnote 3 and section 3.3.^{3} The performance measurements in this section are the actual numbers from real tests. However, we do not guarantee you will achieve as good of a performance.