STREAM in malloc-ed arrays

STREAM in malloc-ed arrays

Hello,

I am working on the optimization of a bandwidth-bound application for the MIC architecture and would like to understand the optimization for bandwidth.

1) I was able to reproduce Intel's STREAM benchmark results (http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-...). I compiled the original STREAM benchmark for the coprocessor with the arguments "-openmp -O3 -mmic -DSTREAM_ARRAY_SIZE=64000000 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -ffreestanding". This is what I obtained on my system with two 8-core Xeon E5-2670 CPUs and a 5110P Xeon Phi coprocessor, ECC on:

* Host, triad, 16 threads, KMP_AFFINITY=scatter: 67 GB/s

* Coprocessor, triad, 60 threads, KMP_AFFINITY=scatter: 163 GB/s

2) This is all nice and dandy. However, I cannot translate this method to my application. My application uses a large amount of data, which it allocates on the heap using malloc(). And in the STREAM benchmarks, the arrays declared as global variables in the following way:

static STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET], 
  b[STREAM_ARRAY_SIZE+OFFSET], 
  c[STREAM_ARRAY_SIZE+OFFSET];

So the next thing I did, is I modified the STREAM benchmark code so that the arrays are pointer-based:

STREAM_TYPE *a, *b, *c;
// ...
int main() {
  a = (STREAM_TYPE*)malloc(sizeof(STREAM_TYPE)*(STREAM_ARRAY_SIZE+OFFSET));
  b = (STREAM_TYPE*)malloc(sizeof(STREAM_TYPE)*(STREAM_ARRAY_SIZE+OFFSET));
  c = (STREAM_TYPE*)malloc(sizeof(STREAM_TYPE)*(STREAM_ARRAY_SIZE+OFFSET));
  // ... then doing the STREAM benchmark a usual...
  free(a);
  free(b);
  free(c);
}

The results of STREAM with arrays allocated in this way are exactly the same on the CPU, but very different on the coprocessor:

* Host, triad, 16 threads, KMP_AFFINITY=scatter: 66 GB/s

* Coprocessor, triad, 60 threads, KMP_AFFINITY=scatter: 56 GB/s

* Coprocessor, triad, 120 threads, KMP_AFFINITY=scatter: 93 GB/s (best case)

That is, the bandwidth on the coprocessor dropped by a large factor when arrays are allocated using malloc().

3) I tried to recover the performance by dropping the compiler optimization flags. I compiled STREAM with "-openmp -O3 -mmic -DSTREAM_ARRAY_SIZE=64000000". Results:

Host, triad, 16 threads, KMP_AFFINITY=scatter: 66 GB/s

Coprocessor, triad, 60 threads, KMP_AFFINITY=scatter: 118 GB/s

Coprocessor, triad, 120 threads, KMP_AFFINITY=scatter: 133 GB/s

Coprocessor, triad, 240 threads, KMP_AFFINITY=scatter: 136 GB/s (best case)

The bottom line is: I need to malloc() my arrays. With malloc()-ed arrays, I have to drop the compiler optimization flags, and I get 136 GB/s, which is 1.2x slower than with the standard STREAM with optimization flags. Could somebody please explain what is happening here, and whether I should be able to get the optimal 163 GB/s with malloc-ed arrays?

Many thanks!

Andrey

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I too am working on the optimization of bandwidth-bound applications for the MIC architecture, and likewise I am playing with STREAM to figure out what is happening in application kernels.

I am not seeing this particular issue.  That is, when I change STREAM from static allocation to either malloc() or aligned malloc, _mm_malloc(n,64), bandwidth remains the same.

icc -O3 -mmic -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming-stores always -DOFFSET=27 -DSTREAM_ARRAY_SIZE=64000000 -openmp stream.c

Coprocessor, triad, 60 threads, KMP_AFFINITY=scatter 168 GB/s

Coprocessor, triad, 120 threads, KMP_AFFINITY=scatter, 159 GB/s

Coprocessor, triad, 240 threads, KMP_AFFINITY=scatter, 141 GB/s

 

Hi Gregg,

thank you for your response! After digging around some more, I have found how to fix my problem. Initially, I declared the arrays like this (in the global scope):

STREAM_TYPE *a, *b, *c;

The fix was to declare the arrays like this (also in the global scope):

static STREAM_TYPE *a, *b, *c;

The addition of the keyword "static" brings back the 165 GB/s bandwidth. It does not matter in this case whether I use malloc() or _mm_malloc(n, 64), and whether I use my original optimization flags or the flags that you are using. 

So, I am a happy camper now because I get good bandwidth. However, I do not understand why not making the global pointers static has such a tremendous effect on the performance.

Andrey

When I ran across this problem I originally assumed it was due to array alignment issues, but I was wrong.

This was a case where the compiler's optimization report was very helpful -- the "-vec-report6" option told me that the problem was that the STREAM loops could not be vectorized because of potential aliasing between *a, *b, and *c.  

This can be fixed via the "-fno-aliasing" compiler option, but that is dangerous.  It works for STREAM, since the C version of STREAM is a fairly direct port from the Fortran version, where aliasing is not allowed, so it is not used, but it is probably not a good idea for more general code.

This can also be fixed by adding IVDEP directives to each loop, but I find that approach ugly and tedious.

I am pleased to see that the use of the "static" keyword is enough to convince the compiler that the pointers don't alias, though I can't find any place in the standard where this is guaranteed.

John D. McCalpin, PhD
"Dr. Bandwidth"

Interestingly, for the host version of the benchmark, the compiler is not so suspicions, and the loops get vectorized with or without the "static" keyword.

I looked into this some more and learned why the "static" keyword helps....

In the context of global variables, the "static" keyword tells the compiler that no routines external to the current file can modify those variables.  Without "static", any routine called by STREAM (such as "printf") could theoretically modify the a, b, c pointers and cause them to be aliased.
With the "static" keyword, the compiler knows that all references are in the current file, so it can prove that the pointers are never modified, ensuring that array references using the pointers will not alias.

I made four versions to test this:

  1. Original version using statically allocated (global) arrays:   double a[STREAM_ARRAY_SIZE], etc
  2. Dynamically allocated global arrays: double *a, *b, *c (global), later allocated with malloc().
  3. Dynamically allocated static global arrays: static double *a, *b, *c (global), later allocated with malloc().
  4. Dynamically allocated *private* arrays -- double *a, *b, *c (inside main()), allocated with malloc().
    This last case required changing checkSTREAMresults() to use passed arguments rather than globals.

Results:

  • Cases 1 and 4 gave excellent results -- Triad values of over 170 GB/s using 60 threads (KMP_AFFINITY=scatter).
  • Case 2 gave terrible results -- Triad values of ~60 GB/s.  Inspection of the assembly code showed that the vector arithmetic instructions were using masks to operate on one element at a time.  This is necessary to get correct results in the case where the pointers are actually aliased.  It is roughly equivalent to loading each data item from the L1 cache 8 times instead of once.
  • Case 3 gave intermediate performance -- Triad values of ~132 GB/s.  Inspection of the assembly code showed that packed vector arithmetic was being used, so the aliasing problem was fixed.  Unfortunately the compiler declined to generate streaming stores in this case, so the performance was reduced due to the extra read (allocate) traffic. 

I don't know why Andrey Vladimirov got full speed with code that was essentially the same as my version 3.  Probably a difference in compilers?   My compiler says it is:
          icc (ICC) 13.1.0 20130121

John D. McCalpin, PhD
"Dr. Bandwidth"

With latest compiler I found all four versions are getting more than 170 GB/s.

Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.2.183 Build 20130514

Surely the right answer here is to use the "restrict" type qualifier of C99 to explain to the compiler that the targets of the pointers do not alias.
http://en.wikipedia.org/wiki/Restrict 

So write the declarations as 

STREAM_TYPE * restrict a;
STREAM_TYPE * restrict b;
STREAM_TYPE * restrict c;

Quote:

John D. McCalpin wrote:

I don't know why Andrey Vladimirov got full speed with code that was essentially the same as my version 3.  Probably a difference in compilers?   My compiler says it is:

          icc (ICC) 13.1.0 20130121

I checked it again. I do get full speed for "triad" with version 3 when I did not use the compiler argument "-ffreestanding". If I do use "-ffreestanding", then I get the same intermediate bandwidth as you (~120 GB/s). However, without "-ffreestanding", the result of "triad" is good, but "copy" is slightly degraded — I did not spot it yesterday.

So here is version 3 without "-ffreestanding":

static STREAM_TYPE *a;
static STREAM_TYPE *b;
static STREAM_TYPE *c;

icpc -O3 -openmp -mmic -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0  -DSTREAM_ARRAY_SIZE=64000000 -o stream-static-MIC stream-malloc-static.c

Function Best Rate MB/s Avg time Min time Max time
Copy: 138489.3 0.007482 0.007394 0.007626
Scale: 152748.0 0.006794 0.006704 0.006891
Add: 163058.7 0.009538 0.009420 0.009682
Triad: 163335.7 0.009470 0.009404 0.009538

At the same time, I confirmed that the "restrict" keyword gives vectorization and full bandwidth in all tests with "-ffreestanding". Here is "version 5"  (it requires a compiler argument -restrict):

STREAM_TYPE * restrict a;
STREAM_TYPE * restrict b;
STREAM_TYPE * restrict c;

icpc -O3 -openmp -DSTREAM_ARRAY_SIZE=64000000 -mmic -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -ffreestanding -restrict -o stream-restrict-MIC stream-restrict.c

Function Best Rate MB/s Avg time Min time Max time
Copy: 159149.5 0.006626 0.006434 0.006744
Scale: 153523.3 0.006762 0.006670 0.006837
Add: 163788.3 0.009455 0.009378 0.009580
Triad: 164226.7 0.009408 0.009353 0.009460

I get the same results with the latest compiler "Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.3.192 Build 20130607"

Leave a Comment

Please sign in to add a comment. Not a member? Join today