Dissecting STREAM benchmark with Intel® Performance Counter Monitor

Intel® Performance Counter Monitor (Intel® PCM) is an API and a set of tools that should help developers to understand how their applications utilize the underlying compute platform. In this blog I will explain how to instrument the well-known STREAM benchmark with library functions of Intel® PCM reading statistics directly from integrated memory controllers available on the latest Intel® Xeon® 5500, 5600, 7500 and Core™ processor series. With this instrumentation we will be able to measure the actual memory traffic while running the “Triad” part of the benchmark.


STREAM is a synthetic benchmark mostly exercising memory subsystem. Its source code is freely available from http://www.cs.virginia.edu/stream/. For this exercise I took the OpenMP version. The most popular STREAM Triad sub-benchmark computes A[i] = B[i] + q*C[i], where A,B and C are vectors and q is a constant. To estimate the amount of data read from and written to memory during the Triad I added these few lines to the original code:


diff -u stream_omp_original.c stream_omp.c
--- stream_omp_original.c 2010-11-04 14:10:00.385521739 +0100
+++ stream_omp.c 2010-11-04 14:13:53.255521728 +0100
@@ -45,6 +45,8 @@
# include <float.h>
# include <limits.h>
# include <sys/time.h>
+# include <omp.h>
+# include <cpucounters.h>

/* INSTRUCTIONS:
*
@@ -54,8 +56,8 @@
* that should be good to about 5% precision.
*/
-# define N 2000000
-# define NTIMES 10
+# define N 60000000 // make sure we the matrices do not fit into caches
+# define NTIMES 100 // increase the number of repetitions to increase measurement accuracy

# define OFFSET 0

/*
@@ -179,6 +181,9 @@
printf("For best results, please be sure you know the\n");
printf("precision of your system timer.\n");
printf(HLINE);
+ Monitor::getInstance()->program();
+ long long unsigned readBytes = 0, writtenBytes = 0;


/* --- MAIN LOOP --- repeat test cases NTIMES times --- */

@@ -216,6 +221,7 @@
times[2][k] = mysecond() - times[2][k];

times[3][k] = mysecond();
+ SystemCounterState begin = getSystemCounterState();
#ifdef TUNED
tuned_STREAM_Triad(scalar);
#else
@@ -224,6 +230,12 @@
a[j] = b[j]+scalar*c[j];
#endif
times[3][k] = mysecond() - times[3][k];
+ SystemCounterState after = getSystemCounterState();
+ if(k>0) { // skip the first iteration
+ readBytes += getBytesReadFromMC(begin,after);
+ writtenBytes += getBytesWrittenToMC(begin,after);
+ }

}

/* --- SUMMARY --- */
@@ -238,6 +250,8 @@
}
}
+ printf("TRIAD read bw: %11.4f write bw: %11.4f\n",1.0E-06*readBytes/avgtime[3],1.0E-06*writtenBytes/avgtime[3]);
+ Monitor::getInstance()->cleanup();


printf("Function Rate (MB/s) Avg time Min time Max time\n");
for (j=0; j<4; j++) {
avgtime[j] = avgtime[j]/(double)(NTIMES-1);


Since Intel® PCM is written in C++, I have compiled the binary using a C++ compiler instead of a C compiler and also linked the object files from the Intel® PCM. Here are the additional compilation options I used in Linux:


-I<path_to_Intel_PCM> <path_to_Intel_PCM>/cpucounters.o <path_to_Intel_PCM>/msr.o <path_to_Intel_PCM>/pci.o -lpthread


For the experiment I had a system with two Intel® Xeon® 5600 series processors and Linux SLES10SP3. I run the benchmark and to my convenience it computed not only the Triad benchmark score but also the real memory traffic estimations from the memory controller during the test. The first time I have looked at the results I quickly realized that the Triad bandwidth score was about 25% lower than the measured memory controller bandwidth. Why is that? That is a huge difference…
To shed more light on this I have looked at the contribution of measured reads and writes into the total memory controller bandwidth and the situation became a bit clearer…

Measured memory traffic split for STREAM Triad running on Intel® Xeon® 5600 series processor.

Is this distribution expected? Should it not be 66% reads and 33% writes? According to the “A[i] = B[i] + q*C[i]” Triad algorithm the application performs two array reads per write… But not according to the hardware! Because of the cache coherency protocols the processor first needs to read the cache line containing old data before (over-)writing it. This is done to make sure that no one else has a copy of this cache line in its local caches and the processor writing it has ownership of the line. For more details see this white paper. Okay, knowing this I could understand why Intel® PCM reports three physical reads per write (75%/25% ratio) and that Triad does not count the additional 25% read traffic. This explained the 25% lower score.

Experienced programmers know that processors also support so-called “non-temporal” stores which bypass caches. These types of instructions may improve performance by alleviating high cache pressure and memory traffic, but the drawback of this approach is that the cache coherency must be then guaranteed by the application and/or compiler.

Fortunately, the Intel compiler can make use of these instructions without much effort required. And in certain situations it can also do it automatically if one compiles with the highest optimization options. I have decided to try this feature by compiling the code with the Intel® C++ compiler version 11.1. The data from Intel® PCM correctly showed me that with this optimized binary, the Triad score has matched the total measured memory controller traffic and also about two reads per write (68%/32%) were issued (as expected!).
Measured memory traffic split for STREAM Triad with non-temporal stores running on Intel® Xeon® 5600 series processor.

This was a small example of using Intel® Performance Counter Monitor that helped me to understand the impact of coherency on the STREAM benchmark. May be you will have your own successes if you try our tool: for example, you can quickly find out if your software is memory bandwidth-limited or NUMA-(un)friendly. By the way, if you cannot modify your application or do not want to, Intel® PCM can be also executed in background as a stand-alone monitor in text or graphical mode while you are running your program.


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm Copyright © 2010, Intel Corporation.
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

 

For more complete information about compiler optimizations, see our Optimization Notice.

8 comments

Top
Shiva A.'s picture

Hi Roman,

Please comment if I have x64 memory on system running @ 1333MT/s. The Stream(triad) score is 6000.

Now, is this 6000MB/s?

Expected BW = 1600MT/s * 64/8 = 10664MB/s.

So, does this mean the memory efficiency is 60%?

Thanks!

Ingo M.'s picture

Hi Roman,

Thanks for the interesting post! I just brought your patch up to date. You can download it here: http://pastebin.com/eecP2J6u. It applies like this:

patch -p1 < pcm.patch

Also, compilation changed with the new version of PCM. Now the command is:

-I<path_to_Intel_PCM> <path_to_Intel_PCM>/cpucounters.o <path_to_Intel_PCM>/msr.o <path_to_Intel_PCM>/pci.o -lpthread

Best regards,

Ingo

Ron B.'s picture

Roman,

I found an updated document from Scott Huck that answers my first question. I had forgotten that Stream throughput now varies with the model number and clock speed thanks to the synchronous memory controller on Sandy Bridge.

I am still interested in coding or compilation techniques to improve "Stream over QPI" performance.

Ron

Ron B.'s picture

Roman,

Using your recommendations and those in Scott Huck's paper I was able to get Stream Triad performance on Sandy Bridge EP servers reasonably close to Intel's published spec of 79.6 GB/s. I profiled Stream under VTune using the SNB-EP "Bandwidth" analysis, which also reports GB/s of total read+write and generally speaking get good agreement between the stream triad value and VTune's measurement, which is shown as a time series graph.

Additional VTune tests running stream under numactl showed that the per-package Stream and VTune bandwidths were 50% of the 2p total, as you might expect. (Using "numactl --membind=0 --cpunodebind=0 stream" for example.)

When fully optimized, Stream throughput approaches 80% of the IMC spec. For example, ~80 GB/s on SNB-EP compared to the HW spec of 102 GB/s for both packages (or half of those values for one package).

An interesting additional test is to measure QPI bandwidth with Stream by using memory on package_0 and threads on package_1. (Using "numactl --membind=0 --cpunodebind=1 stream") Actual QPI throughput when measured this way is not as close to the HW spec as is IMC throughput to local memory. For either NHM-EP or SNB-EP, stream throughput is roughly one third of the QPI spec: for example 11 to 12 GB/s out of 32 GB/s for SNB-EP.

The VTune bandwidth timelines for QPI vs IMC data paths are interesting. IMC data throughput is relatively constant over time with some small repetitive variation. QPI throughput is erratic.

So, I have two questions:
1. Do you or Scott have updated guidelines for optimal Stream Triad performance on SNB-EP? I would be interested in seeing the configuration used to reach 79.6 GB/s.
2. Do you have time or interest in looking at the erratic VTune bandwidth for Stream over QPI? The intermittent data flow rate suggests some lack of optimization in the HW. I don't expect you to comment on that, but I would be curious if Stream configuration changes can be made to improve QPI throughput.

I support applications with algorithms that are "stream-like" in nature and cannot guarantee optimal data locality due to the size of data arrays, so this question is of practical import.

Thanks,
Ron Bennett

Roman Dementiev (Intel)'s picture

anum,

PMU stands for Performance Monitoring Unit (a piece of hardware)

PCM in this context stands for Performance Counter Monitor, a software tool to access PMUs

there is a forum where you can ask questions regarding PMU and Intel PCM: http://software.intel.com/en-us/forums/platform-monitoring/

Roman

anonymous's picture

I want to know about the difference between PMU and PCM?

marshal-singh's picture

yeah this is true Itried this and found it correct ......

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.