can icc generate permutation instructions without using intrinsics

can icc generate permutation instructions without using intrinsics

Is there any way I can get icc to generate the permutation instructions without using intrinsics?
For example, icc -xAVX vectorizes the block in f1, but not the one in f2:

struct d4 {

    double d[4] __attribute__((aligned(32)));

};
void f1 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c) {

    c[0].d[0]+=a[0].d[0]*b[0].d[0];

    c[0].d[1]+=a[0].d[1]*b[0].d[1];

    c[0].d[2]+=a[0].d[2]*b[0].d[2];

    c[0].d[3]+=a[0].d[3]*b[0].d[3];

}
void f2 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c) {

    c[0].d[0]+=a[0].d[1]*b[0].d[0];

    c[0].d[1]+=a[0].d[0]*b[0].d[1];

    c[0].d[2]+=a[0].d[3]*b[0].d[2];

    c[0].d[3]+=a[0].d[2]*b[0].d[3];

}

f1 comes out as
        vmovupd   (%rdi), %ymm0                                 #33.26

        vmulpd    (%rsi), %ymm0, %ymm1                          #33.26

        vaddpd    (%rdx), %ymm1, %ymm2                          #33.5

        vmovupd   %ymm2, (%rdx)                                 #33.5

And I'd like f2 to come out as
        vmovupd   (%rdi), %ymm0                                 #33.26

        vpermilpd $0x5,%ymm0,%ymm0

        vmulpd    (%rsi), %ymm0, %ymm1                          #33.26

        vaddpd    (%rdx), %ymm1, %ymm2                          #33.5

        vmovupd   %ymm2, (%rdx)                                 #33.5

but I cannot figure out how to get the compiler to vectorize it. I can do it with intrinsics, as follows, but I'd rather avoid writing with intrinsics, since I'm trying to explain this to someone without first teaching the intrinsics.
 void f3 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c, int n) {

    __m256d *av = (__m256d*)a;

    __m256d *bv = (__m256d*)b;

    __m256d *cv = (__m256d*)c;

    *cv = _mm256_add_pd(*cv,

		       _mm256_mul_pd(_mm256_permute_pd(*av, 5),

				     *bv));

}


Any help from people who have experience convincing icc to generate these instructions from ordinary C code with no intrinsics?
-Bradley

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Georg Zitzlsberger (Intel)'s picture

Hello,

unfortunately there's no way to force the compiler to (auto-)vectorize using certain instructions. The only way to do so is to either use intrinsics or write the assembly manually.

The reason why our compiler is not using packed/AVX instructions (like permutations from your example) here is that it'd be much slower. I've verified that by measuring the execution time with "rdtsc" and manually creating the assembly. There's roughly 30% difference in runtime.

The reason for this is that the execution units of the CPU are best utilized with the code the compiler genrates now (software pipelining). The thoughput is higher than for the packed versions and interleaved load/operate/store allows the computations to be started earlier, keeping in mind the super-scalar out-of-order architecture we have.

Neither compact code nor using packed instructions is a guarantee for best performance per se. Our compiler has lots of heuristics to find the best instruction sequences. Some of them might be the opposite one would expect but are worth it.

We appreciate your feedback and kindly ask everyone to let us know about inefficient patterns you might encounter.

Thank you & best regards,

Georg Zitzlsberger

P.S.: This is what we're currently creating for "f2(...)":

f2:

        vmovsd    8(%rdi), %xmm0                                #13.16

        vmovsd    (%rdi), %xmm3                                 #14.16

        vmovsd    24(%rdi), %xmm6                               #15.16

        vmovsd    16(%rdi), %xmm9                               #16.16

        vmulsd    (%rsi), %xmm0, %xmm1                          #13.26

        vmulsd    8(%rsi), %xmm3, %xmm4                         #14.26

        vmulsd    16(%rsi), %xmm6, %xmm7                        #15.26

        vaddsd    (%rdx), %xmm1, %xmm2                          #13.5

        vmulsd    24(%rsi), %xmm9, %xmm10                       #16.26

        vaddsd    8(%rdx), %xmm4, %xmm5                         #14.5

        vaddsd    16(%rdx), %xmm7, %xmm8                        #15.5

        vaddsd    24(%rdx), %xmm10, %xmm11                      #16.5

        vmovsd    %xmm2, (%rdx)                                 #13.5

        vmovsd    %xmm5, 8(%rdx)                                #14.5

        vmovsd    %xmm8, 16(%rdx)                               #15.5

        vmovsd    %xmm11, 24(%rdx)                              #16.5

        ret                                                     #17.1

The software pipelining pattern is clearly visible, which:
  • Loads %xmm[0|3|6|9] registers independently
  • Does operations (vmulsd) for each such %xmm[0|3|6|9] registers
  • Because of independence this can take place in an interleaved way
    (e.g. while %xmm[3|6|9] are still loading vmulsd with %xmm0 can already be executed by another execution unit)
Georg Zitzlsberger (Intel)'s picture

Hello,

I'd like to add that using Intel Cilk Plus Array Notations here can provide faster code for you:

struct d4 {

    double d[4] __attribute__((aligned(32)));

};
void f1 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c) {

    c[0].d[0]+=a[0].d[0]*b[0].d[0];

    c[0].d[1]+=a[0].d[1]*b[0].d[1];

    c[0].d[2]+=a[0].d[2]*b[0].d[2];

    c[0].d[3]+=a[0].d[3]*b[0].d[3];

}
void f1_cilk (struct d4 * a, struct d4 * b, struct d4 *__restrict__ c) {

    c[0].d[:]+=a[0].d[:]*b[0].d[:];

}
void f2 (struct d4 * a, struct d4 * b, struct d4 *__restrict__ c) {

    c[0].d[0]+=a[0].d[1]*b[0].d[0];

    c[0].d[1]+=a[0].d[0]*b[0].d[1];

    c[0].d[2]+=a[0].d[3]*b[0].d[2];

    c[0].d[3]+=a[0].d[2]*b[0].d[3];

}
void f2_cilk (struct d4 * a, struct d4 * b, struct d4 *__restrict__ c) {

    unsigned int perm[4] = {1, 0, 3, 2};

    c[0].d[:]+=a[0].d[perm[:]]*b[0].d[:];

}

The *_cilk versions are making use of the Array Notations. Both "f1(...)" and "f1_cilk(...)" produce the same assembly. However, the assembly produced for "f2_cilk(...)" is more efficient (~2% on my system; yours might differ) than the one for "f2(...)".

The assembly created for "f2_cilk(...)" is this:

f2_cilk:

        vmovsd    8(%rdi), %xmm0

        vmovsd    24(%rdi), %xmm1

        vmovhpd   (%rdi), %xmm0, %xmm2

        vmovhpd   16(%rdi), %xmm1, %xmm3

        vinsertf128 $1, %xmm3, %ymm2, %ymm4

        vmulpd    (%rsi), %ymm4, %ymm5

        vaddpd    (%rdx), %ymm5, %ymm6

        vmovupd   %ymm6, (%rdx)

        vzeroupper

        ret
It's still not using the permutation operation, though.

In any way, using the Array Notations provides you determinism about the underlying code being vectorized.
A nice side-effect is that the implementations are much better to read now.

Best regards,

Georg Zitzlsberger

Cilk+ array notation apparently implies -ansi-alias __restrict__ and #pragma vector always so it gives the compiler extra shots at finding optimization. A few compiler experts consider it a bug if the compiler can't optimize equivalent plain C for() loop code with the aid of those options. In this case, the introduction of the perm[] vector should help the compiler with C as well as Cilk+ code.
Array notation is like a foot in the door toward having the compiler require and take advantage of the standard compliance aspects of -ansi-alias.
The compiler likely would not use the AVX-256 instructions if it did not have the __attribute__((aligned(32))) qualifier. The code may be valid regardless of alignment, but could be much slower on Sandy Bridge than AVX-128 code if it were not aligned. The compiler never uses vmovapd even though it would be valid for the cases where the compiler chooses vmovupd %ymm...
From what I've seen so far, the compiler doesn't choose AVX-256 over AVX-128 on account of Ivy Bridge compile options.

Are there any cases where the compiler *does* produce the vector instructions? I'm hacking on a more complex code than this, and I'd like not to have to write intrinsics for the permutations.

Georg Zitzlsberger (Intel)'s picture

Hello,

there are, e.g.:

void perm(double * __restrict__ dp, double *sp, int n)

{

    int i;

    __assume_aligned(dp, 32);

    __assume_aligned(sp, 32);

    for(i = 0; i < n; i++){

        dp[2 * i]     = sp[2 * i + 1];

        dp[2 * i + 1] = sp[2 * i];

    }

}

...produces this for the loop (-xAVX):

..B2.4:                         # Preds ..B2.4 ..B2.3

        lea       (%rcx,%rcx), %r8d                             #17.21

        addl      $16, %ecx                                     #16.5

        movslq    %r8d, %r8                                     #18.21

        vmovupd   (%rsi,%r8,8), %ymm0                           #12.6

        vmovupd   32(%rsi,%r8,8), %ymm1                         #12.6

        vmovupd   64(%rsi,%r8,8), %ymm4                         #12.6

        vmovupd   96(%rsi,%r8,8), %ymm5                         #12.6

        vperm2f128 $32, %ymm1, %ymm0, %ymm2                     #17.21

        vperm2f128 $49, %ymm1, %ymm0, %ymm3                     #17.21

        vunpcklpd %ymm3, %ymm2, %ymm9                           #17.21

        vunpckhpd %ymm3, %ymm2, %ymm8                           #17.21

        vperm2f128 $32, %ymm5, %ymm4, %ymm6                     #17.21

        vperm2f128 $49, %ymm5, %ymm4, %ymm7                     #17.21

        vmovupd   128(%rsi,%r8,8), %ymm4                        #12.6

        vmovupd   160(%rsi,%r8,8), %ymm5                        #12.6

        vunpcklpd %ymm9, %ymm8, %ymm10                          #18.9

        vunpckhpd %ymm9, %ymm8, %ymm11                          #18.9

        vmovupd   192(%rsi,%r8,8), %ymm8                        #12.6

        vmovupd   224(%rsi,%r8,8), %ymm9                        #12.6

        vunpcklpd %ymm7, %ymm6, %ymm15                          #17.21

        vunpckhpd %ymm7, %ymm6, %ymm14                          #17.21

        vperm2f128 $32, %ymm11, %ymm10, %ymm12                  #18.9

        vperm2f128 $49, %ymm11, %ymm10, %ymm13                  #18.9

        vunpcklpd %ymm15, %ymm14, %ymm0                         #18.9

        vunpckhpd %ymm15, %ymm14, %ymm1                         #18.9

        vperm2f128 $32, %ymm5, %ymm4, %ymm6                     #17.21

        vperm2f128 $49, %ymm5, %ymm4, %ymm7                     #17.21

        vperm2f128 $32, %ymm9, %ymm8, %ymm10                    #17.21

        vperm2f128 $49, %ymm9, %ymm8, %ymm11                    #17.21

        vmovupd   %ymm12, (%rdi,%r8,8)                          #12.6

        vmovupd   %ymm13, 32(%rdi,%r8,8)                        #12.6

        vperm2f128 $32, %ymm1, %ymm0, %ymm2                     #18.9

        vperm2f128 $49, %ymm1, %ymm0, %ymm3                     #18.9

        vunpcklpd %ymm7, %ymm6, %ymm13                          #17.21

        vunpckhpd %ymm7, %ymm6, %ymm12                          #17.21

        vunpcklpd %ymm11, %ymm10, %ymm0                         #17.21

        vunpckhpd %ymm11, %ymm10, %ymm10                        #17.21

        vmovupd   %ymm2, 64(%rdi,%r8,8)                         #12.6

        vmovupd   %ymm3, 96(%rdi,%r8,8)                         #12.6

        vunpcklpd %ymm13, %ymm12, %ymm14                        #18.9

        vunpckhpd %ymm13, %ymm12, %ymm15                        #18.9

        vunpcklpd %ymm0, %ymm10, %ymm1                          #18.9

        vunpckhpd %ymm0, %ymm10, %ymm2                          #18.9

        vperm2f128 $32, %ymm15, %ymm14, %ymm12                  #18.9

        vperm2f128 $49, %ymm15, %ymm14, %ymm13                  #18.9

        vperm2f128 $32, %ymm2, %ymm1, %ymm3                     #18.9

        vperm2f128 $49, %ymm2, %ymm1, %ymm4                     #18.9

        vmovupd   %ymm12, 128(%rdi,%r8,8)                       #12.6

        vmovupd   %ymm13, 160(%rdi,%r8,8)                       #12.6

        vmovupd   %ymm3, 192(%rdi,%r8,8)                        #12.6

        vmovupd   %ymm4, 224(%rdi,%r8,8)                        #12.6

        cmpl      %eax, %ecx                                    #16.5

        jb        ..B2.4        # Prob 82%                      #16.5

I got this example from our compiler engineers. This kind of a reverse search for specific patterns producing certain instructions should not be taken as guaranteed. The compiler's optimization algorithms might change it with every (bigger) update.
I'm only providing it to you as a demonstration that the permute instructions are in fact used by the compiler.

Does this answer your question?

Best regards,

Georg Zitzlsberger

Login to leave a comment.