Hmm, when you comment out the writes, the compiler should be smart enough to figure out that you are not using the result and it should completely elimitate the loop. I'm therefore surprised that the loop takes any time at all. Are you sure that your measurements are precise enough?
Storing data is bottleneck?
如需更全面地了解编译器优化,请参阅优化注意事项.




Storing data is bottleneck?
Hi,
I'm writing some example code of AVX like below:
double a[SIZE]__attribute__((aligned(32)));
double b[SIZE]__attribute__((aligned(32)));
double c[SIZE]__attribute__((aligned(32)));
srand(time(NULL));
for(inti=0; i<SIZE; i++) {
a[i] = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
b[i] = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
}
__m256d ymm0, ymm1, ymm2;
gettimeofday(&t0,NULL);
for(inti=0; i<SIZE; i+=4) {
ymm0 = _mm256_load_pd(a+i);
ymm1 = _mm256_load_pd(b+i);
ymm2 = _mm256_mul_pd(ymm0, ymm1);
_mm256_store_pd(c+i, ymm2);
}
gettimeofday(&t1,NULL);
double time1;
time1 = (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)*1.0E-6;
double sum;
for(inti=0; i<SIZE; i++) {
sum += c[i];
}
And the result of the time1 in the code was 6.750000e-04(sec) .
That is slower result than scalar version which recorded around 5.0e-04(sec)..
Then, I've found that if I comment-out the storing part (_mm256_store_pd(c+i, ymm2); ), the results get more faster than before( time1 get 1.9300e-04(sec)).
Acording to these results, I think that storing data from ymm register to memory is bottleneck... but, is that right?
Is there any good way to store data while preventing an increase in execution time?
(The actual code was attached.)
OS: Mac OSX 10,8,2
CPU: 2GHz Intel Core i7
Compiler: gcc 4.8
Compiler-options: -mavx (AVX version only)
Thanks.