Measuring Memory Bandwidth on the Intel® Xeon Phi™ Coprocessor

The memory bandwidth of an application is an important metric to have at your fingertips when optimizing your application. One can measure the memory bandwidth of an application running on the Intel Xeon Phi coprocessor by one of the two ways: by using the core hardware events or by using the uncore hardware events.

You can find more details on how to measure the memory bandwidth of your application on the coprocessor using the core events on this (http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding) page. The main advantage of the core event method is that it allows you to see all the components that contribute towards your bandwidth. On the other hand, the core event method requires a custom collection and does not pre-compute the metrics for you.

The second method is to use uncore events to measure memory bandwidth using the memory bandwidth analysis provided by Intel® VTune™ Amplifier XE 2013 Update 3 or higher. This method automatically calculates the bandwidth for you without any added effort on your part. However, the simplicity of the method obscures the individual component events which contribute to the calculation of memory bandwidth. 

Setting CPU mask in the Intel VTune Amplifier project properties.

Uncore hardware events are counted by a performance monitoring unit (PMU) outside the core itself and hence the name “uncore events”. The Intel VTune Amplifier driver samples uncore events by periodically interrupting the processor after a certain number of clock ticks and reading the number of samples collected over a sampling interval. Modern processors have multiple cores and the value of the Uncore event is the same irrespective of which core samples it. By default, whenever an uncore event is ready to be sampled, all the cores are interrupted and each core samples the event. This generates a large number of duplicate samples: one copy per core. Although, the large amount of data generated due the duplicates does not affect the accuracy of data, but this may slow down the post-processing of data that occurs when Intel VTune Amplifier displays the results. To speed up the processing, you can set the CPU mask to ‘1’ in the Intel VTune Amplifier project properties so that the samples are collected only by ‘CPU 1’, which is generally always awake. Please keep in mind that setting the CPU mask is suitable only if the workload is pretty evenly distributed across all the CPUs. Otherwise, samples may be missed when the CPU 1 is inactive. 

Lastly, if your memory bandwidth using the second method is not what you expect then it is better to use the first method. It allows you to gain a better understanding of the measurement by scrutinizing the individual components of the memory bandwidth metric. 

* VTune is a trademark of Intel Corporation in the U.S. and/or other countries.

AttachmentSize
Download cpu-mask-2.jpg107.66 KB
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Roman Dementiev (Intel)'s picture

unfortunately the picture is not showing in the article.

Sumedh Naik (Intel)'s picture

Thanks Roman for bringing this to my notice. I have updated the image url.