I am trying to understand a few things here. But first let's see a sample of the code:
#define LEN 10000000 int main(){ double* a = aligned_alloc(32,LEN*sizeof(double)); double* b = aligned_alloc(32,LEN*sizeof(double)); double* c = aligned_alloc(32,LEN*sizeof(double)); int k; for(k = 0; k < LEN; k++){ a[k] = rand(); b[k] = rand(); } for(k = 0; k < LEN; k++) c[k] = a[k] * b[k];
The vectorization report gives the following (icc -xAVX -O2 vec.c -o vect -qopt-report-phase=vec -qopt-report=5)
LOOP BEGIN at vec.c(27,3) remark #15388: vectorization support: reference c[k] has aligned access [ vec.c(28,5)] remark #15389: vectorization support: reference a[k] has unaligned access [ vec.c(28,12) ] remark #15389: vectorization support: reference b[k] has unaligned access [ vec.c(28,19) ] remark #15381: vectorization support: unaligned access used inside loop body
However, if I use the following instead:
double* a = _mm_malloc(LEN*sizeof(double),32); double* b = _mm_malloc(LEN*sizeof(double),32); double* c = _mm_malloc(LEN*sizeof(double),32);
reports
LOOP BEGIN at vec.c(27,3) remark #15388: vectorization support: reference c[k] has aligned access [ vec.c(28,5) ] remark #15388: vectorization support: reference a[k] has aligned access [ vec.c(28,12) remark #15388: vectorization support: reference b[k] has aligned access [ vec.c(28,19)]
1- Why does it happen? What is the difference?
2- How do I know I am using the best alignment possible for my architecture? I am testing this on my desktop (Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz).