Does ICC support __attribute__ ((aligned(16))) in arguments declare ?

Does ICC support __attribute__ ((aligned(16))) in arguments declare ?

imagem de zhangxiuxia

I write a funtion namely
int product(double *__attribute__ ((aligned (16)))A, double *__attribute__((aligned (16)))x, double *__attribute__((aligned(16))) y, int n)
2 {
3 int i,j;
4 for(j=0;j<100000;j++)
5 {
6 for(i=0;i<n;i++)
7 {
8 y[i]=y[i]+A[i]*x[i];
9 }
10 }
11 return 0;
12 }

it cannot pass when use icc to compile , but can pass when use gcc to compile ?

When I delare a variable using __attribute ((aligned(16))) in side a function, can pass compile when use icc .

9 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Aubrey W. (Intel)

Hello,

I will move this to our Intel C++ Compiler forum where one of our engineers can assist you.

Best regards,

==
Aubrey W.
Intel Software Network Support

imagem de BradleyKuszmaul

Here are some ways I found to make icc understand that my arguments are aligned. Here is some code:

void slowproduct (double *A, double *B, double  * C) {

    #pragma intel simd

    for (int i=0; i<4; i++) {

	C[i] += A[i]  * B[i];

    }

}

The slowproduct code vectorizes, but on my i7-2640M which has AVX it doesn't use the 256-bit registers probably because it doesn't know that A, B, and C are aligned. It produces
        vmovupd   (%rdi), %xmm0                                 #4.18

        vmulpd    (%rsi), %xmm0, %xmm1                          #4.18

        vaddpd    (%rdx), %xmm1, %xmm2                          #4.2

        vmovupd   %xmm2, (%rdx)                                 #1.6

        vmovupd   16(%rdi), %xmm3                               #4.18

        vmulpd    16(%rsi), %xmm3, %xmm4                        #4.18

        vaddpd    16(%rdx), %xmm4, %xmm5                        #4.2

        vmovupd   %xmm5, 16(%rdx)                               #1.6

But this code produces really good code.
struct d4 {

    double d[4] __attribute__((aligned(16)));

};
void product (struct d4 * A, struct d4 * B, struct d4 *__restrict__ C) {

    for (int i=0; i<4; i++) {

	C->d[i] += A->d[i]  * B->d[i];

    }

}

It produces this for the whole loop:

        vmovupd   (%rdi), %ymm0                                 #15.24

        vmulpd    (%rsi), %ymm0, %ymm1                          #15.24

        vaddpd    (%rdx), %ymm1, %ymm2                          #15.2

        vmovupd   %ymm2, (%rdx)                                 #15.2

Which is a vector load a vector multiply and a vector add. I used the following to compile it:
icc -O2 -std=c99 -xHost  -S -o slowprod.S slowprod.c
There are several relevant issues: 1) By putting the array into a struct I was able to make the struct be properly aligned. If you want an array of 100000 you may want to declare A as "struct d4 A[25000]" and then write the doubly nested loop. 2) I declared C to be __restrict__ so that the compiler would understand that it can vectorize the code. You can also do #pragma intel simd which will tell the compiler to vectorize. 3) I used -xHost to make the compiler produce the fastest code it can for my particular machine. Sandy bridge has AVX with 256-bit vector registers (4 doubles). Furthermore, sandy bridge can issue 8 floating point operations per cycle, so my 2.8GHz laptop can peak at 44.8GFLOPS (using both cores, but with turboboost disabled). Here is another way to get the compiler to generate good code using pragmas.
void fastproduct (double *A, double *B, double  * C) {

    #pragma vector aligned

    #pragma intel simd

    for (int i=0; i<4; i++) {

	C[i] += A[i]  * B[i];

    }

}
Putting it all together, if I write this code
void bigproduct(double *A, double *x, double * y, int     n)

{

    #pragma vector aligned

    #pragma intel simd

    for(int i=0;i
	y[i]=y[i]+A[i]*x[i];

    }

}

It produces really nice avx instructions for the inner loop, and it unrolls the loop 4 times, producing this inner loop:

..B4.12:                        # Preds ..B4.12 ..B4.11

        vmovupd   (%rdi,%rax,8), %ymm0                          #32.17

        vmulpd    (%rsi,%rax,8), %ymm0, %ymm1                   #32.17

        vaddpd    (%rdx,%rax,8), %ymm1, %ymm2                   #32.17

        vmovupd   %ymm2, (%rdx,%rax,8)                          #27.6

        vmovupd   32(%rdi,%rax,8), %ymm3                        #32.17

        vmulpd    32(%rsi,%rax,8), %ymm3, %ymm4                 #32.17

        vaddpd    32(%rdx,%rax,8), %ymm4, %ymm5                 #32.17

        vmovupd   %ymm5, 32(%rdx,%rax,8)                        #27.6

        vmovupd   64(%rdi,%rax,8), %ymm6                        #32.17

        vmulpd    64(%rsi,%rax,8), %ymm6, %ymm7                 #32.17

        vaddpd    64(%rdx,%rax,8), %ymm7, %ymm8                 #32.17

        vmovupd   %ymm8, 64(%rdx,%rax,8)                        #27.6

        vmovupd   96(%rdi,%rax,8), %ymm9                        #32.17

        vmulpd    96(%rsi,%rax,8), %ymm9, %ymm10                #32.17

        vaddpd    96(%rdx,%rax,8), %ymm10, %ymm11               #32.17

        vmovupd   %ymm11, 96(%rdx,%rax,8)                       #27.6

        addq      $16, %rax                                     #31.5

        cmpq      %rcx, %rax                                    #31.5

        jb        ..B4.12       # Prob 82%                      #31.5

I hope these ideas help. -Bradley

imagem de Tim Prince

The 64-bit ABIs provide for default 16-byte alignment of objects large enough to need it (in contexts where the compiler is free to choose alignment). When icc doesn't give as strict alignments as gcc does, it seems to be a compatibility bug. The compiler does give 32-byte alignments already for some situations
I've heard that a command line option will come with 13.0 compiler which will allow specification of default alignments, at least up to the alignment required by future architectures, including cache line alignment.
The AVX code where the compiler chooses AVX-128 on account of not knowing alignment is better than the corresponding SSE4 code, but you may have to specify aligned(32) to get the best AVX alignment. Sandy Bridge has some severe performance issues at cache line boundaries with unaligned AVX-256 data. Ivy Bridge is supposed to correct these, but I haven't seen any distinction between them in the compiler.
In my experience, -xhost doesn't produce the fastest code for architectures prior to Sandy Bridge when it translates to -xSSE4.2. That's one of the reasons why some of us don't like -xhost.

imagem de BradleyKuszmaul
  1. I agree that the original post exposes an icc compatability bug. Icc should should accept __align__ attributes on array arguments.
  2. Are you saying that I should align things on 32-byte boundaries to get the best avx-256 performance on sandy bridge?
  3. Can you recommend a better compiler flag than -xHost for Nehalem? Is -xHost good for Sandy Bridge, or is there a better choice there too?
-Bradley
imagem de Tim Prince

2. Yes, if the compiler can be assured of 32-byte alignment, it should give optimum AVX performance. Ivy Bridge should be less critical, but the current compiler tends not to differentiate between them.
3. Depending on the application, I've found either -xSSE4.1 or (less often) -xSSE2 (the default) giving best results on Nehalem and Westmere. -xSSE4.1 uses SSSE3 code in some places where it's beneficial (and SSE4.2 doesn't) on those platforms. According to the published hardware optimization guides, the compiler is doing the right thing with -xHost, but it doesn't always work out. So you may as well use an option which works better on a wider range of architectures.
On Sandy Bridge, -xHost has to be the same as -xAVX, which is the only option for generating AVX code. I haven't tracked all the situations where I've seen AVX code slower than SSE2 in the past; the 12.1 compilers made big improvements there. I still don't have direct access to any Sandy Bridge. Some people have noticed that Ivy Bridge options run OK on Sandy Bridge, but it's likely to be because the code happened to come out identical.

imagem de Judith Ward (Intel)

I entered this in our bug tracking database as DPD200283679.

Here is the test case I used:

// currently gets a compilation error with icpc but passes with g++

extern "C" int printf(const char*,...);

void foo(double *__attribute__ ((aligned (16)))A)
{
if (__alignof(A) == 16)
printf("PASSED\n");
else
printf("FAILED\n");
}

int main() {
double* A = new double;
foo(A);
return 0;
}

thank you for reporting this.

Judy

imagem de jeff_keasler

Issue #672743 has been tested, and allows you to do this:

typedef double * __restrict__ __attribute__((align_value (32))) Real_ptr ;

Creating
such a typedef in a header file allows vectorization without cluttering
your core code with compiler directives and restrict keywords.

If there is another standardized way of declaring this alignment attribute (as the rest of this thread implies), then perhaps that would be better syntax for this functionality? At any rate, I'm just happy that it works now. :)

BTW, Fixing issue 682457 would extend the scope where optimizations for the new typedef introduced in issue #672743 would apply.

imagem de Jennifer J. (Intel)

Issue reported by zhangxiuxia (internal tracker DPD200283679) has been fixed in 13.0. It is available for download from Intel Registration Center.

Jennifer

Faça login para deixar um comentário.