Vectorization: code produces 2 load instructions where 1 would suffice

Vectorization: code produces 2 load instructions where 1 would suffice

I have been examining some of the output of a loop code which should (and does) vectorize using icc 8.1 (Linux, EMT64 version). The source looks like this:

void vadd_double(int vl, int k, U64 *vi, U64 *vj, U64 *vk){
double *vi_d = (double *) vi;
double *vj_d = (double *) vj;
double *vk_d = (double *) vk;
int e;
if(k == 0){
for(e=0; e
*vi_d++ = *vj_d++ + vk_d[0];
else if(k == 1){
for(e=0; e
vi_d[e] = vj_d[e] + vk_d[e];

According to the -vec_report, both the loops vectorize. I would expect in both the loop where k==0 and the k==1 that packed load and store instructions would be generated (the MOVAPD instruction) to move the array data to and from the XMM 128-bit registers. This is true for the first case, but for the loop under k==1, the high and low double are loaded separately using MOVSD and MOVHPD instructions. It looks like this:

movsd (%rsi,%rcx), %xmm1
movhpd 8(%rsi,%rcx), %xmm1
movsd (%rsi,%r8), %xmm0
movhpd 8(%rsi,%r8), %xmm0
addpd %xmm0, %xmm1
movapd %xmm1, (%rsi,%rdx)
addq $8, %rax
movsd 16(%rsi,%rcx), %xmm
movhpd 24(%rsi,%rcx), %xmm
movsd 16(%rsi,%r8), %xmm2
movhpd 24(%rsi,%r8), %xmm2
addpd %xmm2, %xmm3
movapd %xmm3, 16(%rsi,%rdx)

I modified this by hand to use a single MOVAPD instruction to load the XMM registers (they are already stored using this instruction). The performance seems to be about the same between the compiler-generated and hand-modified, perhaps favoring the compiler generated code slightly. Is there more latency to using the MOVAPD instruction to load (which would motivate the compiler choice of two load instructions) or is this a code generation error?

Chris Kauffman

Message Edited by on 06-13-2005 03:00 PM

Message Edited by on 06-13-2005 03:02 PM

1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.