Storing data is bottleneck?

Storing data is bottleneck?

Arthur U.的头像

Hi,

I'm writing some example code of AVX like below:

   double a[SIZE]__attribute__((aligned(32)));
   double b[SIZE]__attribute__((aligned(32)));
   double c[SIZE]__attribute__((aligned(32)));

   srand(time(NULL));

   for(inti=0; i<SIZE; i++) {
        a[i] = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
        b[i] = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
    }
  __m256d ymm0, ymm1, ymm2;

   gettimeofday(&t0,NULL);
  for(inti=0; i<SIZE; i+=4) {
        ymm0 = _mm256_load_pd(a+i);
        ymm1 = _mm256_load_pd(b+i);
        ymm2 = _mm256_mul_pd(ymm0, ymm1);
        _mm256_store_pd(c+i, ymm2);
    }
    gettimeofday(&t1,NULL);

    double time1;
    time1 = (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)*1.0E-6;

   double sum;
   for(inti=0; i<SIZE; i++) {
        sum += c[i];
    }

And the result of the time1 in the code was 6.750000e-04(sec) .
That is slower result than scalar version which recorded around 5.0e-04(sec)..

Then, I've found that if I comment-out the storing part (_mm256_store_pd(c+i, ymm2); ), the results get more faster than before( time1 get 1.9300e-04(sec)).

Acording to these results, I think that storing data from ymm register to memory is bottleneck... but, is that right?
Is there any good way to store data while preventing an increase in execution time? 

(The actual code was attached.)
OS: Mac OSX 10,8,2
CPU: 2GHz Intel Core i7
Compiler: gcc 4.8
Compiler-options: -mavx (AVX version only) 

Thanks.

附件尺寸
下载 simpletest-avx.cpp1.09 KB
下载 simpletest.cpp820 字节
3 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项.
Thomas Willhalm (Intel)的头像

Hmm, when you comment out the writes, the compiler should be smart enough to figure out that you are not using the result and it should completely elimitate the loop. I'm therefore surprised that the loop takes any time at all. Are you sure that your measurements are precise enough?

Sergey Kostrov的头像

>>...Are you sure that your measurements are precise enough?

>>...
>>gettimeofday( &t0, NULL );
>>for(inti=0; i >>{
>>...
>>}
>>gettimeofday( &t1, NULL );
>>...

Even if a CRT-function gettimeofday is not the fastest, compared to RDTSC or gettime, I don't see any problems with how it is used. However, it makes sense to try another functions.

登陆并发表评论。