AVX is slower than serial execution ?

AVX is slower than serial execution ?

I write a simple program and build with icpc to examine the performance of AVX in my mathine. The code snippet is as following,

  #define T 2000000
  #define X 16
  #define Y 16
  #define Z 16

  for(int t=0;t<T;t++)
  for(int k=0;k<Z;k++)
  for(int j=0;j<Y;j++)
  for(int i=0;i<X;i++)
    A[k][j][i]=B[k][j][i]+C[k][j][i];

The configures are as following,

            icpc version 13.1.0 (gcc version 4.6.1 compatibility)

            FFLAGS="-O3 -xhost "

            Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

            Red Hat Enterprise Linux Server release 6.3 (Santiago)

The exeperiment result is as following,

niterator
2000000
2000000
200000

size
12*12*12
16*16*16
32*32*32

time  (s)
 
 
 

serial
1.09918
2.58384
2.99971

avx
1.71405
4.01935
5.18318

 As the table, AVX version always cost more time then serial version.

Can somebody know why?

Thanks in advance!!!

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Did you check if the AVX code was vectorized?

Yes, I have read the assemble.

        vmovupd   B(,%rdx,8), %ymm0                             #35.16
        vmovupd   32+B(,%rdx,8), %ymm2                          #35.16
        vmovupd   64+B(,%rdx,8), %ymm4                          #35.16
        vmovupd   96+B(,%rdx,8), %ymm6                          #35.16
        vaddpd    C(,%rdx,8), %ymm0, %ymm1                      #35.27
        vaddpd    32+C(,%rdx,8), %ymm2, %ymm3                   #35.27
        vaddpd    64+C(,%rdx,8), %ymm4, %ymm5                   #35.27
        vaddpd    96+C(,%rdx,8), %ymm6, %ymm7                   #35.27
        vmovupd   %ymm1, A(,%rdx,8)                             #35.5
        vmovupd   %ymm3, 32+A(,%rdx,8)                          #35.5
        vmovupd   %ymm5, 64+A(,%rdx,8)                          #35.5
        vmovupd   %ymm7, 96+A(,%rdx,8)                          #35.5
        addq      $16, %rdx                                     #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.9        # Prob 99%                      #32.3

 

Can you post serial version of the code(disassembly)?

 

 

 

 

 

 

OK, this is the assembly,

..B1.7:
        vmovsd    B(,%rdx,8), %xmm0                             #35.16
        vaddsd    C(,%rdx,8), %xmm0, %xmm1                      #35.27
        vmovsd    %xmm1, A(,%rdx,8)                             #35.5
        incq      %rdx                                          #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.7        # Prob 99%                      #32.3

 

OK, this is the assembly,

..B1.7:
        vmovsd    B(,%rdx,8), %xmm0                             #35.16
        vaddsd    C(,%rdx,8), %xmm0, %xmm1                      #35.27
        vmovsd    %xmm1, A(,%rdx,8)                             #35.5
        incq      %rdx                                          #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.7        # Prob 99%                      #32.3

 

Are you hiding all those levels of loops in disassembly and the details of how you tested for a reason? Maybe more shortcuts were taken in the comparison.   Then again, maybe it doesn't matter for a compiler that old.

Hi Zhang,

A small runable test case would be helpful to other users to figure out what was wrong for you.

Thanks.

 

Looks like  unrolled 4x  version is executed 500k times and per each outer for - loop T there is 4096 cycles where inner loops are executed.Each such a inner loop cycle consist of 12 AVX instructions.Probably this could be the reason for the slower execution of unrolled version.

Hi iliyapolak.

I cann`t understand what you said. Would you have a more detail explaintion about that?

Hi iliyapolak.

I cann`t understand what you said. Would you have a more detail explaintion about that?

@zhang

Sorry probably I badly formulated my answer.

I meant that unrolled version executed in total more AVX instruction than serial version.

 

Small correction of the post #3.  Outermost loop is not unrolled.By looking at posted disassembly it seems that k and j loops were collapsed or fused into one loop which is unrolled 4x.Probably this unrolling contributed to the worse performance of "vector" version by inserting more machine code instruction.

Assure that arrays A, B, C are cache line aligned.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today