Vector programming. SSE4.2 to AVX2 conversion examples.

By Evgeny V Stupachenko, published on January 15 , 2015

In this blog I’ll try to show how to convert SSE4.2 assembly to AVX2 (using the schemes from the blog Programming using AVX2) and how this affects performance.

  • Easy case. When it is enough to add “v” prefix and replace “xmm” with “ymm”.

Consider we have the following loop:

for (i = 0; i < 1024; i++)
  a[i] += b[i];

where “a” and “b” are unsigned char arrays.

Then assembler for SSE4.2 should be like:

.L2:
        movdqa a(%rax), %xmm0
        paddb b(%rax), %xmm0
        addq $16, %rax
        movaps %xmm0, a-16(%rax)
        cmpq $1024, %rax
        jne .L2

Converting to AVX2 is easy (adding “v” prefix, replacing “xmm” with “ymm” and extend loads/stores step from 16 to 32):

.L2:
        vmovdqa a(%rax), %ymm0
        vpaddb b(%rax), %ymm0, %ymm0
        addq $32, %rax
        vmovaps %ymm0, a-32(%rax)
        cmpq $1024, %rax
        jne .L2

Gain is about 1,9 times. The technique is the same for pmul, psub, pmin, …

  • Permutations (from the blog).

VPACK

Consider we have the following loop:

for (i = 0; i < 1024; i++)
  a[i] = b[2 * i];

where “a” and “b” are unsigned char arrays.

Then one of possible solutions for SSE4.2 (%xmm1 is 0xff00ff00ff00ff00ff00ff00ff00ff):

.L2:
        movdqa b(%rax,%rax), %xmm0
        movdqa b+16(%rax,%rax), %xmm2
        addq $16, %rax
        pand %xmm1, %xmm0
        pand %xmm1, %xmm2
        packuswb %xmm2, %xmm0
        movaps %xmm0, a-16(%rax)
        cmpq $1024, %rax
        jne .L2

Converting to AVX2 we should do the same replacements as for Easy case and add “vpermq” instruction (%ymm1 should be twice longer constant):

.L2:
        vmovdqa b(%rax,%rax), %ymm0
        vmovdqa b+32(%rax,%rax), %ymm2
        addq $32, %rax
        vpand %ymm1, %ymm0, %ymm0
        vpand %ymm1, %ymm2, %ymm2
        vpackuswb %ymm2, %ymm0, %ymm0
        vpermq $216, %ymm0, %ymm0
        vmovaps %ymm0, a-32(%rax)
        cmpq $1024, %rax
        jne .L2

Gain is about 1,5 times.

VPUNPCK{L,H}

Consider we have the following loop:

for (i = 0; i < 1024; i++)
    {
       b[2 * i] = a[i];
       b[2 * i + 1] = 2 * a[i];
    }

Then one of possible solutions for SSE4.2:

.L2:
        movdqa  a(%rax), %xmm0
        movdqa  %xmm0, %xmm1
        movdqa  %xmm0, %xmm2
        paddb   %xmm0, %xmm1
        punpcklbw       %xmm1, %xmm2
        punpckhbw       %xmm1, %xmm0
        movaps  %xmm2, b(%rax,%rax)
        movaps  %xmm0, b+16(%rax,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L2

Converting to AVX2 we should do the same replacements as for Easy case and add 2 “vpermi128” instructions:

.L2:
        vmovdqa a(%rax), %ymm0
        vmovdqa %ymm0, %ymm1
        vmovdqa %ymm0, %ymm2
        vpaddb  %ymm0, %ymm1, %ymm1
        vpunpcklbw      %ymm1, %ymm2, %ymm2
        vpunpckhbw      %ymm1, %ymm0, %ymm0
        vperm2i128 $32, %ymm0, %ymm2, %ymm1
        vperm2i128 $49, %ymm0, %ymm2, %ymm0
        vmovaps %ymm1, b(%rax,%rax)
        vmovaps %ymm0, b+32(%rax,%rax)
        addq    $32, %rax
        cmpq    $1024, %rax
        jne     .L2

The performance is almost flat (~3% gain) in this case.

Processor used for the measurements: "Haswell: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz"

Note that the conversions do not always generate optimal AVX2 code. However, we can see that converted code is generally faster. Moreover when we add more calculations (without permutations) to a loop the performance gain becomes more closer to 2 times.

1

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804