Vector programming. SSE4.2 to AVX2 conversion examples.

By Evgeny V Stupachenko,

Published:01/15/2015   Last Updated:01/15/2015

In this blog I’ll try to show how to convert SSE4.2 assembly to AVX2 (using the schemes from the blog Programming using AVX2) and how this affects performance.

  • Easy case. When it is enough to add “v” prefix and replace “xmm” with “ymm”.

Consider we have the following loop:

for (i = 0; i < 1024; i++)
  a[i] += b[i];

where “a” and “b” are unsigned char arrays.

Then assembler for SSE4.2 should be like:

.L2:
        movdqa a(%rax), %xmm0
        paddb b(%rax), %xmm0
        addq $16, %rax
        movaps %xmm0, a-16(%rax)
        cmpq $1024, %rax
        jne .L2

Converting to AVX2 is easy (adding “v” prefix, replacing “xmm” with “ymm” and extend loads/stores step from 16 to 32):

.L2:
        vmovdqa a(%rax), %ymm0
        vpaddb b(%rax), %ymm0, %ymm0
        addq $32, %rax
        vmovaps %ymm0, a-32(%rax)
        cmpq $1024, %rax
        jne .L2

Gain is about 1,9 times. The technique is the same for pmul, psub, pmin, …

  • Permutations (from the blog).

VPACK

Consider we have the following loop:

for (i = 0; i < 1024; i++)
  a[i] = b[2 * i];

where “a” and “b” are unsigned char arrays.

Then one of possible solutions for SSE4.2 (%xmm1 is 0xff00ff00ff00ff00ff00ff00ff00ff):

.L2:
        movdqa b(%rax,%rax), %xmm0
        movdqa b+16(%rax,%rax), %xmm2
        addq $16, %rax
        pand %xmm1, %xmm0
        pand %xmm1, %xmm2
        packuswb %xmm2, %xmm0
        movaps %xmm0, a-16(%rax)
        cmpq $1024, %rax
        jne .L2

Converting to AVX2 we should do the same replacements as for Easy case and add “vpermq” instruction (%ymm1 should be twice longer constant):

.L2:
        vmovdqa b(%rax,%rax), %ymm0
        vmovdqa b+32(%rax,%rax), %ymm2
        addq $32, %rax
        vpand %ymm1, %ymm0, %ymm0
        vpand %ymm1, %ymm2, %ymm2
        vpackuswb %ymm2, %ymm0, %ymm0
        vpermq $216, %ymm0, %ymm0
        vmovaps %ymm0, a-32(%rax)
        cmpq $1024, %rax
        jne .L2

Gain is about 1,5 times.

VPUNPCK{L,H}

Consider we have the following loop:

for (i = 0; i < 1024; i++)
    {
       b[2 * i] = a[i];
       b[2 * i + 1] = 2 * a[i];
    }

Then one of possible solutions for SSE4.2:

.L2:
        movdqa  a(%rax), %xmm0
        movdqa  %xmm0, %xmm1
        movdqa  %xmm0, %xmm2
        paddb   %xmm0, %xmm1
        punpcklbw       %xmm1, %xmm2
        punpckhbw       %xmm1, %xmm0
        movaps  %xmm2, b(%rax,%rax)
        movaps  %xmm0, b+16(%rax,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L2

Converting to AVX2 we should do the same replacements as for Easy case and add 2 “vpermi128” instructions:

.L2:
        vmovdqa a(%rax), %ymm0
        vmovdqa %ymm0, %ymm1
        vmovdqa %ymm0, %ymm2
        vpaddb  %ymm0, %ymm1, %ymm1
        vpunpcklbw      %ymm1, %ymm2, %ymm2
        vpunpckhbw      %ymm1, %ymm0, %ymm0
        vperm2i128 $32, %ymm0, %ymm2, %ymm1
        vperm2i128 $49, %ymm0, %ymm2, %ymm0
        vmovaps %ymm1, b(%rax,%rax)
        vmovaps %ymm0, b+32(%rax,%rax)
        addq    $32, %rax
        cmpq    $1024, %rax
        jne     .L2

The performance is almost flat (~3% gain) in this case.

Processor used for the measurements: "Haswell: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz"

Note that the conversions do not always generate optimal AVX2 code. However, we can see that converted code is generally faster. Moreover when we add more calculations (without permutations) to a loop the performance gain becomes more closer to 2 times.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.