Dissecting STREAM benchmark with Intel® Performance Counter Monitor

Intel® Performance Counter Monitor (Intel® PCM) is an API and a set of tools that should help developers to understand how their applications utilize the underlying compute platform. In this blog I will explain how to instrument the well-known STREAM benchmark with library functions of Intel® PCM reading statistics directly from integrated memory controllers available on the latest Intel® Xeon® 5500, 5600, 7500 and Core™ processor series. With this instrumentation we will be able to measure the actual memory traffic while running the “Triad” part of the benchmark.

STREAM is a synthetic benchmark mostly exercising memory subsystem. Its source code is freely available from http://www.cs.virginia.edu/stream/. For this exercise I took the OpenMP version. The most popular STREAM Triad sub-benchmark computes A[i] = B[i] + q*C[i], where A,B and C are vectors and q is a constant. To estimate the amount of data read from and written to memory during the Triad I added these few lines to the original code:

diff -u stream_omp_original.c stream_omp.c
--- stream_omp_original.c 2010-11-04 14:10:00.385521739 +0100
+++ stream_omp.c 2010-11-04 14:13:53.255521728 +0100
@@ -45,6 +45,8 @@
# include <float.h>
# include <limits.h>
# include <sys/time.h>
+# include <omp.h>
+# include <cpucounters.h>

@@ -54,8 +56,8 @@
* that should be good to about 5% precision.
-# define N 2000000
-# define NTIMES 10
+# define N 60000000 // make sure we the matrices do not fit into caches
+# define NTIMES 100 // increase the number of repetitions to increase measurement accuracy

# define OFFSET 0

@@ -179,6 +181,9 @@
printf("For best results, please be sure you know the\n");
printf("precision of your system timer.\n");
+ Monitor::getInstance()->program();
+ long long unsigned readBytes = 0, writtenBytes = 0;

/* --- MAIN LOOP --- repeat test cases NTIMES times --- */

@@ -216,6 +221,7 @@
times[2][k] = mysecond() - times[2][k];

times[3][k] = mysecond();
+ SystemCounterState begin = getSystemCounterState();
#ifdef TUNED
@@ -224,6 +230,12 @@
a[j] = b[j]+scalar*c[j];
times[3][k] = mysecond() - times[3][k];
+ SystemCounterState after = getSystemCounterState();
+ if(k>0) { // skip the first iteration
+ readBytes += getBytesReadFromMC(begin,after);
+ writtenBytes += getBytesWrittenToMC(begin,after);
+ }


/* --- SUMMARY --- */
@@ -238,6 +250,8 @@
+ printf("TRIAD read bw: %11.4f write bw: %11.4f\n",1.0E-06*readBytes/avgtime[3],1.0E-06*writtenBytes/avgtime[3]);
+ Monitor::getInstance()->cleanup();

printf("Function Rate (MB/s) Avg time Min time Max time\n");
for (j=0; j<4; j++) {
avgtime[j] = avgtime[j]/(double)(NTIMES-1);

Since Intel® PCM is written in C++, I have compiled the binary using a C++ compiler instead of a C compiler and also linked the object files from the Intel® PCM. Here are the additional compilation options I used in Linux:

-I<path_to_Intel_PCM> <path_to_Intel_PCM>/cpucounters.o <path_to_Intel_PCM>/msr.o <path_to_Intel_PCM>/pci.o -lpthread

For the experiment I had a system with two Intel® Xeon® 5600 series processors and Linux SLES10SP3. I run the benchmark and to my convenience it computed not only the Triad benchmark score but also the real memory traffic estimations from the memory controller during the test. The first time I have looked at the results I quickly realized that the Triad bandwidth score was about 25% lower than the measured memory controller bandwidth. Why is that? That is a huge difference…
To shed more light on this I have looked at the contribution of measured reads and writes into the total memory controller bandwidth and the situation became a bit clearer…

Measured memory traffic split for STREAM Triad running on Intel® Xeon® 5600 series processor.

Is this distribution expected? Should it not be 66% reads and 33% writes? According to the “A[i] = B[i] + q*C[i]” Triad algorithm the application performs two array reads per write… But not according to the hardware! Because of the cache coherency protocols the processor first needs to read the cache line containing old data before (over-)writing it. This is done to make sure that no one else has a copy of this cache line in its local caches and the processor writing it has ownership of the line. For more details see this white paper. Okay, knowing this I could understand why Intel® PCM reports three physical reads per write (75%/25% ratio) and that Triad does not count the additional 25% read traffic. This explained the 25% lower score.

Experienced programmers know that processors also support so-called “non-temporal” stores which bypass caches. These types of instructions may improve performance by alleviating high cache pressure and memory traffic, but the drawback of this approach is that the cache coherency must be then guaranteed by the application and/or compiler.

Fortunately, the Intel compiler can make use of these instructions without much effort required. And in certain situations it can also do it automatically if one compiles with the highest optimization options. I have decided to try this feature by compiling the code with the Intel® C++ compiler version 11.1. The data from Intel® PCM correctly showed me that with this optimized binary, the Triad score has matched the total measured memory controller traffic and also about two reads per write (68%/32%) were issued (as expected!).
Measured memory traffic split for STREAM Triad with non-temporal stores running on Intel® Xeon® 5600 series processor.

This was a small example of using Intel® Performance Counter Monitor that helped me to understand the impact of coherency on the STREAM benchmark. May be you will have your own successes if you try our tool: for example, you can quickly find out if your software is memory bandwidth-limited or NUMA-(un)friendly. By the way, if you cannot modify your application or do not want to, Intel® PCM can be also executed in background as a stand-alone monitor in text or graphical mode while you are running your program.


Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm Copyright © 2010, Intel Corporation.
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.