Download Article
Download
Intel® Advanced Vector Extensions: Pixel Format Conversions [PDF 1.7MB]
Introduction
Intel® Advanced Vector Extensions (Intel® AVX) is a 256 bit instruction set extension to Intel® Streaming SIMD Extensions (Intel® SSE) and is designed for applications that are floating point intensive. Intel® AVX extends all the 16 XMM registers to 256-bits (YMM registers), thus essentially doubling the width of existing XMM registers which leads to improved performance and power efficiency over 128-bit SIMD instructions. Intel® AVX introduces distinct destination argument that results in fewer register copies, better register use, smaller code size, and other benefits. Intel® AVX also introduces several new instructions for blending and rearranging data in the YMM registers.
This document describes techniques to optimize pixel format conversion routines (commonly used in image processing applications) using the new Intel® AVX extensions. The two conversions demonstrated here are RGB to RGBA and RGBA to RGB. Though, R, G, B, and A components can be different data type in different applications, we only discuss single precision floating point (SP FP) components. The Intel® AVX performance is compared against the scalar version of the conversion routines on the same simulator. The Intel® AVX versions are implemented in compiler intrinsics and the code was compiled using the Intel® C Compiler that supports Intel® AVX intrinsics.
This paper will describe only the Intel® AVX implementation of the format conversions.
The RGB-to-RGBA and RGBA-to-RGB conversion algorithms make use of the Intel® AVX instructions VPERMILPS, VPERM2F128, and VBLENDPS to rearrange, and mask off data when copying from the source to the destination buffers.
RGB to RGBA
The destination and source pixel buffers are aligned to 32-byte boundaries and the conversion routines expect them to be so. The following figure depicts the arrangement of the source and destination buffers in memory, for n pixels. In this figure R0 is at a lower address than G0, and so on. In order to use aligned load and store in Intel® AVX implementation for better performance, destination and source pixel buffers should be aligned on 32-byte boundary in the memory. The Intel® AVX conversion routines make assumption that both destination and source are aligned on a 32-byte boundary.

Figure 1: Arrangement of source and destination pixels in memory
Each YMM register is 256-bit wide which allows us to load and store eight SPFP values at a time. In each iteration of the loop we load multiple source values, rearrange the data, and insert the alpha value (in this example, 1.0) and store the result to the destination address.
Since the conversion is from a 3-channel pixel to 4-channel pixel, we could have loaded twelve SP FP values from the source (four RGB pixels) and written sixteen SP FP (four RGBA pixels) values per iteration. Doing so will force us to use unaligned loads since in the next iteration we have to load pixels from an offset of twelve from the source address. There will be severe performance penalties when the unaligned accesses cross cache-line boundaries. Hence we will try to avoid unaligned loads altogether by unrolling the loop twice to load eight RGB pixels.
The algorithm is implemented in four steps, computing two destination pixels at each step. We first load eight single precision FP values starting from the source address using the _mm256_load_ps() aligned load intrinsic. The values are then shuffled to a temporary YMM register using _mm256_permutevar_ps() intrinsic with a control mask of {0,1,2,0,0,0,1,0} so that the R0, G0, B0, G1, and B1 are copied to their corresponding locations in the destination. Next R1 is broadcast using _mm256_broadcast_ss() to a temporary YMM register and the result is blended using a mask of 16 (00 01 00 00) with the output from the shuffle operation. Finally, the alpha value (1.0) is blended with the result from previous blend operation using a mask of 136 (10 00 10 00) to produce destination pixels zero and one. The result is written to the memory starting at the address of the destination using _mm256_store_ps(). The following figure illustrates this step
(Step1).

Figure 2: RGB to RGBA
Step1 The next eight FP values are loaded and shuffled with the eight values previously loaded using the intrinsic _mm256_permute2f128_ps() with a control mask of 33 (00 10 00 01) to produce an intermediate result. This intermediate result is shuffled using _mm256_permutevar_ps() intrinsic with a control mask of {2,3,0,0,1,2,3,0}, blended with B2 and the alpha value to get the destination pixels two and three. These steps are illustrated below
(Step2).

Figure 3: RGB to RGBA
Step2 The next eight FP values are loaded from an offset of sixteen from the start of the source address and shuffled with the eight FP values loaded in Step2 using an appropriate control mask. These resulting values are in turn shuffled again and blended with R5 and the alpha values, producing destination pixels four and five as illustrated below
(Step3).

Figure 4: RGB to RGBA
Step3 The final set of eight FP values is loaded from an offset of twenty four from the source address. These values are shuffled, blended with B6 and the alpha to produce destination pixels six and seven. These steps are illustrated below
(Step4).

Figure 5: RGB to RGBA
Step4 The source and destination addresses are incremented by twenty four and thirty two respectively. Steps
Step1, Step2, Step3, and
Step4 are repeated for the remainder of pixels.
The figure below shows the source code that demonstrates the above steps.
// 8 RGB ==> RBGA per iteration
// [G2 R2 B1 G1 , R1 B0 G0 R0]
__m256 pixel23 = _mm256_load_ps((float *)(srcPix));
// [* B1 G1 *, * B0 G0 R0], ctrl = [0,1,0,0, 0,2,1,0]
__m256 pixel01 = _mm256_permutevar_ps(pixel23, ctrl);
// [R1 R1 R1 R1 , R1 R1 R1 R1]
__m256 pixelTemp = _mm256_broadcast_ss((float *)(srcPix+3));
// [* B1 G1 R1 , * B0 G0 R0], mask = 00 01 00 00
pixel01 = _mm256_blend_ps(pixel01, pixelTemp, 16);
// [1. B1 G1 R1 , 1. B0 G0 R0], mask = 10 00 10 00
pixel01 = _mm256_blend_ps(pixel01, alphaOne, 136);
_mm256_store_ps((float *)(dstPix), pixel01);
// [R5 B4 G4 R4 , B3 G3 R3 B2]
__m256 pixel45 = _mm256_load_ps((float *)(srcPix+8));
// [B3 G3 R3 B2 , G2 R2 B1 G1] mask = 00 10 00 01
pixel23 = _mm256_permute2f128_ps(pixel23, pixel45, 33);
// [* B3 G3 R3, * * G2 R2], ctrl2 = [0,3,2,1, 0,0,3,2]
pixel23 = _mm256_permutevar_ps(pixel23, ctrl2);
// [B2 B2 B2 B2 , B2 B2 B2 B2]
pixelTemp = _mm256_broadcast_ss((float *)(srcPix+8));
// [* B3 G3 R3 , * B2 G2 R2], mask = 00 00 01 00
pixel23 = _mm256_blend_ps(pixel23, pixelTemp, 4);
pixel23 = _mm256_blend_ps(pixel23, alphaOne, 136);
_mm256_store_ps((float *)(dstPix+8), pixel23);
// [B7 G7 R7 B6, G6 R6 B5 G5]
__m256 pixel67 = _mm256_load_ps((float *)(srcPix+16));
// [G6 R6 B5 G5, R5 B4 G4 R4] mask = 00 10 00 01
pixel45 = _mm256_permute2f128_ps(pixel45, pixel67, 33);
// [* B5 G5 *, * B4 G4 R4]
pixel45 = _mm256_permutevar_ps(pixel45, ctrl);
// [R5 R5 R5 R5 , R5 R5 R5 R5]
pixelTemp = _mm256_broadcast_ss((float *)(srcPix+15));
// [* G6 R6 R6, * B4 G4 R4]
pixel45 = _mm256_blend_ps(pixel45, pixelTemp, 16);
pixel45 = _mm256_blend_ps(pixel45, alphaOne, 136);
_mm256_store_ps((float *)(dstPix+16), pixel45);
// [* B7 G7 R7, * * G6 R6]
pixel67 = _mm256_permutevar_ps(pixel67, ctrl2);
// [B6 B6 B6 B6 , B6 B6 B6 B6]
pixelTemp = _mm256_broadcast_ss((float *)(srcPix+20));
// [* B7 G7 R7, * B6 G6 R6]
pixel67 = _mm256_blend_ps(pixel67, pixelTemp, 4);
pixel67 = _mm256_blend_ps(pixel67, alphaOne, 136);
_mm256_store_ps((float *)(dstPix+24), pixel67);
Figure 6: Intel® AVX RGB to RGBA conversion code
RGBA to RGB
The destination and source pixel buffers are aligned to 32-byte boundaries and the conversion routines expect them to be so. The following figure depicts the arrangement of the source and destination buffers in memory, for
n pixels. In this figure R0 is at a lower address than G0, etc.

Figure 7: Arrangement of source and destination pixels in memory
In each iteration of the loop we load multiple source pixels, rearrange the data, and remove the alpha value and store the result to the destination address.
Since the conversion is from a 4-channel pixel to 3-channel pixel, we need to load sixteen SP FP values from the source (four RGBA pixels) and write twelve values (four RGB pixels) per iteration. Doing so will force us to use unaligned stores since in the next iteration we have to write the result at an offset of twelve from the destination address. As explained before we will avoid all unaligned accesses by unrolling the loop twice thus writing twenty four values (six RGB pixels) at a time.
We first load sixteen SP FP values starting from the source address by invoking the _mm256_load_ps() aligned load intrinsic twice. The pixels are then rearranged using a combination of _mm256_permutevar_ps() and _mm256_permute2f128_ps() instrinsics and the intermediate results blended using an appropriate mask to produce the first set of destination FP values. The following figure illustrates this step
(Step1).

Figure 8: RGBA to RGB
Step1 The next set of eight FP values are loaded and using a series of _mm256_permute2f128_ps(), _mm256_permutevar_ps(), _mm256_blend_ps() and _mm256_broadcast_ss() intrinsics and blending with previously loaded values the next set of eight destination values are produced, as illustrated below
(Step2).

Figure 9: RGBA to RGB
Step2 In the third step
(Step3), source RGBA pixels six and seven are loaded from an offset of twenty four from the source address and shuffled and blended with the previously loaded pixels four and five using a series of _mm256_permute2f128_ps(), _mm256_permutevar_ps(), and _mm256_blend_ps() intrinsics to produce the last set of destination values for the current iteration. The following figure depicts this step.

Figure 10: RGBA to RGB
Step3 The source and destination addresses are incremented by thirty two and twenty four respectively.
Steps Step1, Step2, and
Step3 are repeated for the remainder of pixels.
The figure below shows the source code that demonstrates the above steps.
// 8 RGBA ==> 8 RGB conversion per iteration
// [A1 B1 G1 R1 , A0 B0 G0 R0]
__m256 pixel01 = _mm256_load_ps((float *)(srcPix));
// [* * B1 G1 , * B0 G0 R0]
__m256 pixelTmp = _mm256_permutevar_ps(pixel01, ctrl1);
// [A3 B3 G3 R3 , A2 B2 G2 R2]
__m256 pixel23 = _mm256_load_ps((float *)(srcPix)+8);
// [A2 B2 G2 R2 , A1 B1 G1 R1], 0x21 = 00 10 00 01
__m256 pixel12 = _mm256_permute2f128_ps(pixel01, pixel23, 0x21);
// [G2 R2 * * , R1 * * * ]
pixel12 = _mm256_permutevar_ps(pixel12, ctrl2);
// [G2 R2 B1 G1 , R1 B0 G0 R0], 0xC8 = 11 00 10 00
pixel01 = _mm256_blend_ps(pixelTmp, pixel12, 0xC8);
_mm256_store_ps((float *)(dstPix), pixel01);
// [B2 B2 B2 B2 , B2 B2 B2 B2]
pixelTmp = _mm256_broadcast_ss((float *)(srcPix)+10);
// [A5 B5 G5 R5 , A4 B4 G4 R4]
__m256 pixel45 = _mm256_load_ps((float *)(srcPix)+16);
// [A4 B4 G4 R4 , A3 B3 G3 R3]
__m256 pixel34 = _mm256_permute2f128_ps(pixel23, pixel45, 0x21);
// [* B4 G4 R4 , B3 G3 R3 * ]
pixel23 = _mm256_permutevar_ps(pixel34, ctrl3);
// [* B4 G4 R4 , B3 G3 R3 B2], 0x1 = 00 00 00 01
pixel23 = _mm256_blend_ps(pixel23, pixelTmp, 0x1);
// [R5 R5 R5 R5 , R5 R5 R5 R5]
pixelTmp = _mm256_broadcast_ss((float *)(srcPix)+20);
// [R5 B4 G4 R4 , B3 G3 R3 B2], 0x80 = 10 00 00 00
pixel23 = _mm256_blend_ps(pixel23, pixelTmp, 0x80);
_mm256_store_ps((float *)(dstPix)+8, pixel23);
// [A7 B7 G7 R7 , A6 B6 G6 R6]
__m256 pixel67 = _mm256_load_ps((float *)(srcPix)+24);
// [A6 B6 G6 R6 , A5 B5 G5 R5]
__m256 pixel56 = _mm256_permute2f128_ps(pixel45, pixel67, 0x21);
// [* * * B6 , * * B5 G5]
pixel56 = _mm256_permutevar_ps(pixel56, ctrl4);
// [B7 G7 R7 * , G6 R6 * * ]
pixel67 = _mm256_permutevar_ps(pixel67, ctrl5);
// [B7 G7 R7 B6 , G6 R6 B5 G5], 0xEC = 11 10 11 00
pixel56 = _mm256_blend_ps(pixel56, pixel67, 0xEC);
_mm256_store_ps((float *)(dstPix)+16, pixel56);
Figure 11: Intel® AVX RGBA to RGB conversion code
Results
Two implementations of the conversions - a scalar C++ implementation, and the 256-bit Intel® AVX implementation - were compared for performance on the Intel® AVX simulator. An average of three runs for each implementation is computed and compared for runtime performance. The following table shows the speedup achieved by the 256-bit version.
References and Resources