AVX512 auto-vectorization on i9-7900X

AVX512 auto-vectorization on i9-7900X

Hello,

I wrote a small piece of code to test the auto-vectorization options of the icc compiler. The code sums 2 arrays (vectors) of doubles and stores it in a 3rd array. I tried to compile it with -xCORE-AVX2 and -xCORE-AVX512 and was expecting a portion of the 2x possible theoretical maximum speedup. But, what I saw was almost the same execution time for both versions (sometimes even worse). At first I thought the cause is the size of the array which doesn't fit in the L1, but when I repeated the runs with arrays size of 256 numbers, I got the same speedup. I noticed the same effect with a series of other scientific apps, and I just can't accept the fact that for every single app I tried, I can't get any speedup at all (unless the app was precompiled somwhere).

So I checked the objdump of the AVX512 generated binary for my toy program and noticed that there was no usage of zmm registers in the hot loop. However, when I used an online compiler explorer (https://godbolt.org/#) I could see beautiful AVX512 code which I expected to see on my machine as well. The funny thing is that I even use a more recent version of icc compiler, and I just can't get it to produce good code. What could be the issue here? Are there any hints I have to give the compiler? That doesn't seem to bother the online compiler I tested. I tried using different pragmas, the trip counts are known in advance, arrays are aligned, qopt-report says it vectorized the code.

Any suggestion would be helpful.

Here is the C code for the toy app.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 32768
#define MRUNS 100000000

void init_arrays(double *a, double *b, int n)
{
    int i;
    for(i=0; i<n; i++)
    {
        a[i] = rand();
        b[i] = rand();
    }
}

int main(int argc, char* argv[])
{
    double *a, *b, *c;
    int i, j;
    double sum;
    srand(time(NULL));
    a = (double*) _mm_malloc(N * sizeof(double), 64);
    b = (double*) _mm_malloc(N * sizeof(double), 64);
    c = (double*) _mm_malloc(N * sizeof(double), 64);
    init_arrays(a, b, N);

    for(i=0; i<MRUNS; i++)
    {
        for(j=0; j<N; j++)
        {
            c[j] = a[j]+b[j];
        }
    }

    for(i=0; i<N; i++)
        sum += c[i];
    return 0;
}

Here is a portion of the assembly generated by ICC on my machine:

..B1.12:                        # Preds ..B1.12 ..B1.11
                                # Execution count [3.28e+10]
..L12:
                # optimization report
                # LOOP WAS UNROLLED BY 4
                # LOOP WAS VECTORIZED
                # VECTORIZATION SPEEDUP COEFFECIENT 6.402344
                # VECTOR TRIP COUNT IS KNOWN CONSTANT
                # VECTOR LENGTH 4
                # MAIN VECTOR TYPE: 64-bits floating point
        vmovupd   (%r12,%rcx,8), %ymm0                          #36.20
        vmovupd   32(%r12,%rcx,8), %ymm2                        #36.20
        vmovupd   64(%r12,%rcx,8), %ymm4                        #36.20
        vmovupd   96(%r12,%rcx,8), %ymm6                        #36.20
        vaddpd    (%rbx,%rcx,8), %ymm0, %ymm1                   #36.25
        vaddpd    32(%rbx,%rcx,8), %ymm2, %ymm3                 #36.25
        vaddpd    64(%rbx,%rcx,8), %ymm4, %ymm5                 #36.25
        vaddpd    96(%rbx,%rcx,8), %ymm6, %ymm7                 #36.25
        vmovupd   %ymm1, (%rax,%rcx,8)                          #36.13
        vmovupd   %ymm3, 32(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm5, 64(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm7, 96(%rax,%rcx,8)                        #36.13
        addq      $16, %rcx                                     #34.9
        cmpq      $32768, %rcx                                  #34.9
        jb        ..B1.12       # Prob 99%                      #34.9

And here is the assembly for the same loop generated by the online ICC compiler.

vmovups zmm0,ZMMWORD PTR [r12+rcx*8]
 vmovups zmm2,ZMMWORD PTR [r12+rcx*8+0x40]
 vmovups zmm4,ZMMWORD PTR [r12+rcx*8+0x80]
 vmovups zmm6,ZMMWORD PTR [r12+rcx*8+0xc0]
 vaddpd zmm1,zmm0,ZMMWORD PTR [rbx+rcx*8]
 vaddpd zmm3,zmm2,ZMMWORD PTR [rbx+rcx*8+0x40]
 vaddpd zmm5,zmm4,ZMMWORD PTR [rbx+rcx*8+0x80]
 vaddpd zmm7,zmm6,ZMMWORD PTR [rbx+rcx*8+0xc0]
 vmovupd ZMMWORD PTR [rax+rcx*8],zmm1
 vmovupd ZMMWORD PTR [rax+rcx*8+0x40],zmm3
 vmovupd ZMMWORD PTR [rax+rcx*8+0x80],zmm5
 vmovupd ZMMWORD PTR [rax+rcx*8+0xc0],zmm7
 add rcx,0x20
 cmp rcx,0x8000
 jb 400bf0 <main+0xd0>

I use ICC 17.0.3 and compiler explorer uses 17.0.0.
My CPU is a 10-core i9-7900X. Linux kernel 14.10.0-35-generic. 

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

If I understand correctly, there was a change between compiler versions 17.0.0 and 17.0.3 that caused the compiler to be less aggressive about using 512-bit registers.   This is a challenging problem for compiler heuristics, because code using 256-bit registers can run at higher frequencies than code using 512-bit registers, and the compiler is making guesses about the relative sizes of the performance impacts of increased frequency vs increased vector width.

One workaround with version 17.0.3 is to use -xCOMMON-AVX512 instead of -xCORE-AVX512.   The COMMON-AVX512 target does not include the new heuristic trade-off code and will use zmm registers whenever it is possible to do so.

Starting with 17.0.5 (and 18), there is a new compiler flag "-qopt-zmm-usage=high" to override the default heuristic (which corresponds to "-qopt-zmm-usage=low").  For the Intel 18 compiler, this option is described at https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-r...

 

"Dr. Bandwidth"

Thank you John,

this helped a lot.

I decided to use 18.0 and -xCORE-AVX512 with -qopt-zmm-usage=high.

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today