Loading...
You are not logged-in Login/Register





  • Posts   Search Threads
  • gkliOctober 17, 2008 10:56 AM PDT   
    Stream benchmark performance

    I ran McCalpin's stream benchmark on the 3.4 GHZ Xeon and got 4759 MB/s for Triad.   The original code was inlined and the operations were directly on the global arrays.   I added a similar function but instead of directly using the global array, I passed in the arrays.  The bandwidth for the new Triad function was only 3400 MB/s.  Why did I lose so much bandwidth?    The opt-report indicated both functions were inlined.  

    #define N  2000000

    static double *   a          ;

    static double *   b          ;

    static double * c          ;

     

    int main()

    {

    ...

     

          times[3][k] = mysecond();

            tuned_STREAM_Triad(a,b,c,scalar);  // original function, inlined  4759 MB/s

     

          times[3][k] = mysecond() - times[3][k];

     

          times[4][k] = mysecond();

            tuned_STREAM_Triad_Arg(a,b,c,scalar);  // new function, inlined  3400 MB/s

          times[4][k] = mysecond() - times[4][k];

         

          return 0;

    }

     

    void tuned_STREAM_Triad(double* aa,double* bb,double* cc,double scalar)

    {

          int j;

    #pragma omp parallel for

          for (j=0; j<N; j++)

              a[j] = b[j]+scalar*c[j];

    }

     

    void tuned_STREAM_Triad_Arg(double* restrict aa,double* bb,double* cc,double scalar)

    {

          int j;

    #pragma omp parallel for

    #pragma ivdep

          for (j=0; j<N; j++)

              aa[j] = bb[j]+scalar*cc[j];

    }

     

    Compiled with

    icc -openmp -restrict

     

     

     

     



    TimP (Intel)October 17, 2008 11:14 AM PDT
    Rate
     
    Re: Stream benchmark performance

    Does -opt-report shed any light on the differences?

    I don't know whether it makes a difference, but when I use restrict, I apply it to all the function parameter pointers.

    These functions are simple enough that you could compare the generated code you get with -S.



    jimdempseyatthecoveOctober 17, 2008 12:01 PM PDT
    Rate
     
    Re: Stream benchmark performance

    Quoting - gkli

    I ran McCalpin's stream benchmark on the 3.4 GHZ Xeon and got 4759 MB/s for Triad.   The original code was inlined and the operations were directly on the global arrays.   I added a similar function but instead of directly using the global array, I passed in the arrays.  The bandwidth for the new Triad function was only 3400 MB/s.  Why did I lose so much bandwidth?    The opt-report indicated both functions were inlined.  

    #define N  2000000

    static double *   a          ;

    static double *   b          ;

    static double * c          ;

     

    int main()

    {

    ...

     

          times[3][k] = mysecond();

            tuned_STREAM_Triad(a,b,c,scalar);  // original function, inlined  4759 MB/s

     

          times[3][k] = mysecond() - times[3][k];

     

          times[4][k] = mysecond();

            tuned_STREAM_Triad_Arg(a,b,c,scalar);  // new function, inlined  3400 MB/s

          times[4][k] = mysecond() - times[4][k];

         

          return 0;

    }

     

    void tuned_STREAM_Triad(double* aa,double* bb,double* cc,double scalar)

    {

          int j;

    #pragma omp parallel for

          for (j=0; j<N; j++)

              a[j] = b[j]+scalar*c[j];

    }

     

    void tuned_STREAM_Triad_Arg(double* restrict aa,double* bb,double* cc,double scalar)

    {

          int j;

    #pragma omp parallel for

    #pragma ivdep

          for (j=0; j<N; j++)

              aa[j] = bb[j]+scalar*cc[j];

    }

     

    Compiled with

    icc -openmp -restrict

     

     

     

     

    A couple of comments:

    First, can we assume the proper amount of memory was allocate for a, b, c? (N doubles each)

    Second, you might check the dissassembly code to see if register pressure cause the pointers to one or more of aa, bb, cc to be refetched from memory as opposed to remaining cached.

    Jim Dempsey



    Blog: The Parallel Void
    www.quickthreadprogramming.com

    gkliOctober 20, 2008 3:24 PM PDT
    Rate
     
    Re: Stream benchmark performance

    A couple of comments:

    First, can we assume the proper amount of memory was allocate for a, b, c? (N doubles each)

    Second, you might check the dissassembly code to see if register pressure cause the pointers to one or more of aa, bb, cc to be refetched from memory as opposed to remaining cached.

    Jim Dempsey

    Yes, the pointers were allocated with

    a = (double *) malloc (N*sizeof(double)); 

    likewise for b and c.

    I ran the benchmark for N=2000000.  Would it be reasonable to assume that there is sufficient work in the loop that whether the pointer is cached would not significantly affect performance?

     



    jimdempseyatthecoveOctober 21, 2008 7:08 AM PDT
    Rate
     
    Re: Stream benchmark performance

    What happens when you remove the #pragma ivdep from the 2nd loop?
    What happens when you add #pragma vector always to the 2nd loop?

    In the static memory array loop the address of the arrays require no register usage as they can expressed using the Base part of the addressing.

    In the dynamic array loop the addresses of the arrays are on the stack requiring an instruction to copy the pointer to the array to a register and if there are sufficient registers available the compiler could move the base address fetch outside the loop. To fully registerize the base of the three arrays in the second loop will require an additional three registers over what the first loop reqires.

    An additional matter that could influence performance is the alignment of the arrays. If the static arrays are aligned on a 16 byte address and the dynamic are not. In this case the 2nd the loop (when vectorized) might be performing split loads/stores.

    Jim Dempsey



    Blog: The Parallel Void
    www.quickthreadprogramming.com

Forum jump:  

Intel Software Network Forums Statistics

16,362 users have contributed to 46,323 threads and 163,858 posts to date.

In the past 24 hours, we have 33 new thread(s) 154 new posts(s), and 89 new user(s).

In the past 3 days, the most popular thread for everyone has been Formula for the intersection of straight lines The most posts were made to Take a look at John Burkhard&# The post with the most views is \"-check none\" generates error

Please welcome our newest member dcrumb


For more complete information about compiler optimizations, see our Optimization Notice.