How Do I Measure Memory Bandwidth on an Intel® Core™ i7 or Xeon® 5500 Series Platform Using Intel® VTune™ Performance Analyzer?

Updated!  2/3/2011

The new Intel® Core™ i7 and Xeon® 5500 series processors have a different architecture than previous processors, particularly when it comes to the uncore. The “uncore” is the part of the processor that is external to all the individual cores- for example, on the Core™ i7, there are 4 cores, and these share an L3 cache and a memory interface. The L3 and memory interface are considered uncore. VTune™ Performance Analyzer does not support the sampling of events that are triggered in the uncore of the processor.

However, due to popular demand, we have created and documented a way for VTune analyzer users who have Core i7 or Xeon 5500 series processor-based platforms to measure memory bandwidth. This is not measurable by default since the events needed are in the uncore. Here is the process to enable bandwidth measurement using the program sep. Sep is a utility that provides the sampling functionality used by VTune analyzer and Intel® Performance Tuning Utility (PTU).

Note that with this method the bandwidth events are counted using time-based sampling, not the event-based sampling that VTune analyzer normally uses. This means that you can determine a bandwidth for your whole system over a designated time range, but you won’t be able to see how much of the bandwidth used came from various functions/processes/modules. You can only see the total bandwidth for the system. Please adjust your application testing accordingly by running only the target application while measuring bandwidth.

1. Download the Intel® Performance Tuning Utility 3.2 update 1. The version of sep needed for this method is only available in this release of PTU. If you have a 32-bit operating system, get the IA-32 version, and if you have a 64-bit operating system, get the Intel® 64 version. PTU is available for both Windows* and Linux*.
2. Un-compress the package and follow the instructions in INSTALL.txt to install PTU. Make sure to install the sampling driver!
3. Download the appropriate Uncore Measurement package and uncompress it into a directory of your choice. To download the package, to go http://premier.intel.com, log in, and select File Downloads from the menu on the left. Select either VTune™ Performance Analyzer for Linux* or VTune™ Performance Analyzer for Windows* and click Display File List. The package will be named lin_measurebw.tar.gz for Linux* or win_measurebw.zip for Windows*.
4. Run the bandwidth measurement script (uncore.bat for Windows*, uncore.sh for Linux*) from the uncore directory. This script sets up the environment needed to measure bandwidth, and then uses sep to measure it. It is important that you measure bandwidth using this script to avoid unstable configuration changes to your VTune™ analyzer or PTU installations! If you run this script from a command prompt (instead of double-clicking, close the command window afterwards.
5. Once the bandwidth measurement script has finished executing, open the bandwidth.txt file in the same directory. This file contains the results of bandwidth measurement, and will be overwritten each time you run the bandwidth measurement script. See the Interpreting Bandwidth section below to analyze the data.
6. Now that PTU is installed, you may use it for your sampling needs, or you can use VTune analyzer. PTU will be the current active sampling technology on your system after executing these instructions. You will need to follow these instructions for switching between using PTU and VTune analyzer for sampling.

Interpreting Bandwidth

This method measures bandwidth from each processor’s uncore memory controller to memory. It will include memory reads, memory writes, I/O, and writebacks from L3 to memory. It does not include traffic from cache-to-cache transfers between sockets.

Using this method, your bandwidth.txt results file will contain results in this format:

Version Info: Sampling Enabling Product version: 2.9.devbuild (private) built on Mar 18 2009 02:53:25 P:Intel® Xeon® Processor 5500 series M:10 S:4

UNC_IMC_WRITES.FULL.ANY 14,650,441,461 50,459 50,458 50,481 50,458 15,737 15,741 15,741 15,740
UNC_IMC_NORMAL_READS.ANY 14,650,441,476 196,626 196,618 196,679 196,515 36,071 36,071 36,072 36,072
----------

5.00s real 0.468s user 39.531s system 38.796s idle

Bandwidth from reads and writes is measured separately, and each processor socket is measured separately. In the above output file, the first line of values measures writes to memory and the second line of values measures reads to memory. Each line of output will show a series of event values separated by spaces. The first value after each event is a timestamp (14,650,441,461 & 14,650,441,476 in this example). The following values will be the counts of 64-byte transfers on the memory bus for each core.

It is important to realize that for current Core i7 or Xeon 5500 series processors, there are 4 cores on each socket, all sharing the same uncore. So, you will see 4 values for each socket, but really these are all measuring the same uncore bandwidth. For example, for UNC_IMC_WRITES.FULL.ANY in the example above, the first 4 values after the timestamp are all close to 50,460. They are really all measuring the same bandwidth from socket 0 to its memory, and so should be averaged, not summed! The output above was measured on a dual-socket Xeon 5500 series platform with Intel® Hyper-Threading Technology disabled. There are 8 values for each event – 4 for one socket, and 4 for the other. If Hyper-Threading Technology had been enabled, there would be 8 values per socket, and those 8 should be averaged to get one bandwidth number for each socket.

The number of values you see will correspond to the number of hardware threads on your system. The order in which the values appear may be different on Windows* and on Linux*. For Windows*, usually all the values for one processor socket will appear together. For example, on a dual-socket Windows* platform with Intel® Hyper-Threading Technology enabled, the values may be in the order <Socket 0, Core 0, Hyperthread 0>, <S0, C1, H0>, <S0, C2, H0>, <S0, C3, H0>, <S0, C0, H1>, <S0, C1, H1>, <S0, C2, H1>, <S0, C3, H1>, <S1, C0, H0>, etc, giving you 8 values total for each physical socket. On Linux* the way in which the threads and cores are enumerated varies according to the distribution. You can refer to the /proc/cpuinfo file for your platform to see the way the physical sockets are mapped – for each processor in /proc/cpuinfo, look at the “physical id”. The physical id indicates the socket number. This can help you identify how the values in bandwidth.txt correspond to physical sockets (the values in bandwidth.txt will be in the same order as the processors in the /proc/cpuinfo file). In all cases – just remember that for a particular bandwidth event, you should be seeing roughly the same quantities from cores and hardware threads on the same socket. If you have a dual-socket platform with Hyper-Threading Technology enabled, then half of the values for each bandwidth event will be for each socket. Approximately half should be around the same quantity, and the other half should be a different quantity. On a single-socket platform simply average all the values.

Finally, near the bottom of each result file you will see the time spent sampling – 5 seconds in the example above.

To compute total system bandwidth, use this formula:

Bandwidth (GB/s) = ((average of UNC_IMC_WRITES.FULL.ANY for each socket + average of UNC_IMC_NORMAL_READS.ANY for each socket) * 64 * 1.0e-9) / seconds measured

For the example above, bandwidth is ((50,464 (writes on socket 1) + 15,740 (writes on socket 2) + 196,610 (reads on socket 1) + 36,072 (reads on socket 2) * 64 * 1.0e-9) / 5 = .004 GB/s. This bandwidth was measured on an idle system.

Final Notes

This method can be used to measure total system bandwidth on Core i7 and Xeon 5500 series processor-based platforms. It will not work with any other processors. We also do not recommend using sep for any other sampling – VTune analyzer and PTU have much more friendly user interfaces for collecting and interpreting data. At this time, these events (needed for bandwidth measurement) are the only uncore events we are making available.

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.