Modern x86 CPUs permit instruction level parallelism (e.g. SIMD) on register vectors at most 128-bits. Second Generation Intel® Core™ Processors include the first generation of Intel® Advanced Vector Extensions (Intel® AVX) (256-bit operators), which permits increased parallel processing. This paper outlines a case study in which Intel AVX instructions are used to improve the compute performance of a de-saturation algorithm. The paper also discusses how future integer based Intel AVX instructions might be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms.
Modern x86 CPUs permit Instruction Level Parallelism (ILP), such as Single Instruction Multiple Data (SIMD), on vectors at most 128-bit. These register vectors can be used to process multiple data elements with fewer instructions. Second Generation Intel® Core™ Processors (codenamed Sandy Bridge) included the first generation of AVX, which is a 256-bit instruction set extension to the Intel® Streaming SIMD Extensions (Intel® SSE).
The first generation of Intel AVX included a wide range of instructions designed primarily to accelerate compute intensive algorithms performing arithmetic operations on floating point data. However, even if an algorithm is integer based, using Intel AVX instructions could potentially increase an algorithm’s performance without sacrificing accuracy of the results. In video processing algorithms, the pixel channels are often stored as 8-bit unsigned integers (bytes) and processed as 32-bit or larger format integer values. Therefore, most video algorithms require conversion of pixels to and from a format. Wider bit widths are used for calculation accuracy and smaller formats are used to save space. Typically, floating-point units are not used because the extra conversion costs do not significantly improve accuracy. However, Intel AVX is capable of greatly improving the runtime performance of video processing software and a vast number of other software applications by the increased parallelism.
This paper describes a case study in which Intel AVX instructions are used to enhance the performance of a de-saturation algorithm (a common video filter). The case study takes the algorithm from a non-SIMD state to Intel AVX based SIMD. The paper also discusses how future generations of Intel AVX may be able to further aid performance optimization and enable greater performance of video processing.
On Intel SIMD architectures, a vector register can store a group of data elements of a single data type (e.g. floats or integers). The vector registers of Sandy Bridge are 256 bits wide whereas all other processors since Intel® Pentium III were 128 bits wide. Each vector (called YMM in Sandy Bridge) register can store 8 floats, 8 32-bit integers, 32 chars, etc. Intel AVX instructions operate on the full 256 bits, but SSE can only operate on 128 bits.
A SIMD enabled-processor can execute a single operation on multiple data. An operation performed simultaneously on multiple data elements is a vector process. SIMD vectorization is the process of converting an algorithm from a scalar to a vector implementation. The multiply function in sample code below is used to illustrate the difference between the scalar and SIMD vector process.
Typical video processing algorithms calculate pixel values using a triple for-loop (for each frame, for each X, for each Y). This typically is seen as an area of high CPU utilization (i.e. hotspot). Video processing application hotspots are excellent candidates to optimize with Intel AVX.
A simple approach to optimize using SIMD involves taking advantage of the latest processor technology features, such as Intel AVX. The following sections describe the optimization process using Intel AVX instructions to enhance the performance of a de-saturation algorithm. The serial code implementation is briefly discussed and Intel AVX-based SIMD instructions are used to optimize the de-saturation algorithm. Finally, this chapter ends with our performance results of the optimize code.
The typical implementation of the Desaturation algorithm uses the incoming pixel values to compute a luminance value. The luminance value is applied to all outgoing pixels to de-saturate images as part of processing video for output.
As you can see in the sample code below the algorithm traverses row by row to get pixel data, which channel values (blue, green, and red) are used to calculate the luminance value. In other to achieve high accuracy the algorithm converts the one-byte channel values to single precision floating point. The floating-point values are used in a dot product type of operation to compute the luminance value. The Desaturation sample algorithm uses the fLuminace(…) function to convert pixel channel values from byte to float. The conversion to float is achieved implicitly by typecasting each channel value to float and with weights as constants for Red, Green and Red, the fLuminance(…) function uses the float values to compute luminance which value is applied to the video output.
Note that the conversion of channel data from byte to float occurs implicitly by typecasting to float. Although the scalar code looks simple and trivial, the assembly code generated by the compiler is much more complex. In analysis of the generated assembly code, the implicit byte to float conversion can be performed with fewer instructions by using the more efficient Intel AVX instructions. As we have observed, the serial code calculates one channel and one pixel at a time. Nothing is computed in parallel (ignoring pipelining and reordering). Refer to Appendix A for the assembly code.
This section outlines the transformation of the serial code and describes how Intel AVX, SSE4.1 and SSE2 instructions optimize the de-saturation algorithm. As illustrated in Chapter 2 with SIMD, we can work on many items at once. Therefore, the load, store, conversion and math operations can be done in parallel. The algorithm below describes how we can use instruction level parallelism (via Intel AVX instructions) to significantly improve performance. Note that the algorithm is written with the restriction that we could only use available instructions, not idealistic for future instructions as we discuss later. Therefore, lines 19, 20 and 21 involve an intermediate step to convert 32-bit integers back down to 8-bit unsigned, et cetera.
With the Figure 2 as the backbone of de-saturate, we can implement the real code. The motivations for using a procedure similar to Figures 2 and 3 are that:
Figure 3. De-saturation code optimized Intel AVX1
The algorithm and Intel AVX code shown in Figures 2 and 3 convey the same exact process line-for-line. Notice that only lines 9 and 16 involve doing the real work. Theses lines each process 8 single precision floating point multiplications in parallel, totalling 16 multiplications for 2 instructions versus 16 individual multiplication instructions. Everything else is unnecessary overhead to make use of the parallel instructions or to increase precision.
Despite the overhead, this code still improves performance by 1.45x . If integer based instructions existed with equivalent parallelism to that of single precision floating point, we could further increase performance. In such case, lines 6, 8, 10, 11, 14, 15, 17 and 18 could be eliminated. Lines 9 and 16 would operate on integers instead. Lines 19, 20 and 21 could require a single pack instruction (integer to byte). Of course, there are other hypothetical instructions that could be introduced with future Intel AVX generations. The potential performance gain is left as an exercise for the reader. For assembly instructions generated by intrinsic functions used in the inner [ix] loop, refer to Appendix B.
Performance assessment of the de-saturation algorithm optimized with Intel AVX in this study observed a 1.45x speedup when compared to the serial code. To gather performance data the de-saturation algorithm was applied to a 1440x1080 image and was looped 100 times. Performance was measured in elapsed time (milliseconds) taken to de-saturate the image, the following performance numbers were consistently observed:
|Serial Code:||1264 milliseconds|
|Code with Intel AVX:||873 milliseconds|
|Performance Scaling:||1.45x or 1264ms/873ms|
A kernel (small application program) was used to run the algorithm. A kernel with a 1.45x scaling typically translates to a performance improvement of 10% to 15% when measured at the workload level. However, for video processing this rule of thumb does not apply. Consider you are applying the de-saturation algorithm to a one-minute or longer video clip. In that case, there will be more than 100 frames (images) to process. In theory, since more data has be processed, the performance boost potential could be more than 1.45x especially if or when processing full High-Definition (e.i.,1920x1080) video.
Since our optimized de-saturation algorithm uses one of the SSE4.1 instructions, we will give an overview of SSE4.1 because other SSE4.1 instructions may be applicable for the optimization of other video processing algorithms. The Packed Integer Conversion instruction set contains 12 instructions for packed integer bit width conversions. Any of which can be utilized to optimize code where bit width is to be increased for integer data.
The table in Figure 5 lists the SSE4.1 instructions for packed integer conversions. The instructions support sign extension and zero extension conversions of byte to word, byte to double word, byte to quad-word, word to double word, word to quad-word, and double word to quad-word. Additionally, the chart shows a comparison of SSE2 vs SSE4.1 instructions needed to convert four (4) one-byte integers to four (4) 32-bit integers.
The pmovzxbd (byte to double word) instruction was utilized a total of four (4) times in the de-saturate optimization. When/if these instructions include support for full 256-bit register, the use of this instruction in the optimized algorithm will be reduced to two (2). Thereby further improving the loop performance.
Figure 4. Instructions for bit width conversions of packed integers
The source operand to packed integer conversion instructions is from either an XMM register or memory. The destination is always an XMM register. When accessing memory, no alignment is required, unless alignment checking is enabled. In which case, all conversions must be aligned to the width of the memory being referenced. The number of elements that can be converted and width of memory reference is illustrated in Figure 5. The alignment requirement is shown in parenthesis.
Figure 5. Number of elements to process. P is Packed. MOV is Move (copy register). ZX is Zero Extend. SX is Sign Extend. B is Byte. W is Word. D is Double Word. Q is Quad-Word.
This paper has discussed how Second Generation Intel® Core™ Processors could increase parallel processing via Intel AVX instructions and 256 bit registers. This paper outlined a case study in which Intel AVX instructions were used to improve the compute performance of a de-saturation algorithm. The paper also discussed how future integer based intel AVX instructions could be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms. The procedure described demonstrated how Intel AVX instructions or their intrinsic functions could be utilized to improve the runtime performance of video processing applications. The paper documented that despite some overhead incurred to setup for SIMD processing, the video de-saturation still achieved excellent performance benefits.
Eli Hernandez is an Application Engineer in the Consumer Client and Power Enabling Group at Intel Corporation where he works with customers to optimize their software for power efficiency and to run best on Intel hardware and software technologies. Eli joined Intel in August of 2007 with over 12 years of experience in software development for the telecom and the chemical industry. He received his B.S. in Electrical Engineering in 1989 and completed Master Studies in Computer Science in 1991-1992 from the DePaul University of Chicago.
In 2008, Larry Moore graduated from Saint Petersburg College with Honors. He received a Who's Who Among Students Award and was a member of Phi Theta Kappa Honor Society. In 2011, he spent 8 months at Intel as an application engineer intern, in DuPont, Washington. Currently, he is attending the University of South Florida at Tampa, Florida in an accelerated graduate program, pursuing both a Bachelor of Science and Master of Science in Computer Engineering. His current research involves computer aided verification of real-time systems and model checking. Larry is also a member of IEEE Computer Society.
Roughly 45 instructions to process an iteration of the algorithm inner loop. With throughput of 1 pixels processed per iteration.
Roughly 30 instructions to process an iteration of the algorithm inner loop. With throughput of 4 pixels processed per iteration.
1 Load and store operations on optimized code assumes data is aligned.
2 Please see footnote 3 and section 3.3.
3 The performance measurements in this section are the actual numbers from real tests. However, we do not guarantee you will achieve as good of a performance.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804