One of the popular metrics that is frequently used to estimate performance is FLOP/s. This document shares the results of our experience with using Intel VTune Amplifier XE to estimate FLOP/s.
One of the key features of the Intel Xeon Phi coprocessor is its 512-bit Vector Processing Unit (VPU). All the floating-point operations on the coprocessor, whether vector or scalar, are executed by the vector unit. Hence, we can track the number of floating-point operations executed in a workload by simply monitoring the VPU. For the case of this study, we will be using the VPU_ELEMENTS_ACTIVE hardware event to estimate the number of floating-point operations in the workload. The VPU_ELEMENTS_ACTIVE hardware event couner counts the number of vector operations executed. To further clarify, consider the following example: if a floating point vector multiply was performed on 8 doubles fully packed into one register then the VPU_ELEMENTS_ACTIVE would be incremented by 8 for that instruction. Hence, in theory, you can calculate the FLOP/s as:
FLOP/s = (VPU_ELEMENTS_ACTIVE)/Time
The other piece of information that is needed to estimate FLOP/s is the amount of time spent by application in performing the floating-point operations. You can either collect this information directly by using gettimeofday or something similar. Alternately, you could use the CPU_CLK_UNHALTED hardware event to calculate the time spent in performing the floating-point operations. The actual time can be calculated from hardware event counts as follows:
Generally, it is safer to use wall-clock time instead of the CPU_CLK_UNHALTED hardware event because the CPU_CLK_UNHALTED is affected by a number of factors as explained later on in this article.
$/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED, VPU_ELEMENTS_ACTIVE -- ssh mic0 "<Set up the environment>; ~/a.out"
$/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED, VPU_ELEMENTS_ACTIVE -knob target-cards=0 -- ./a.out
Once the profiling run is complete, the results can be viewed using the Intel VTune Amplifier XE GUI. The summary view of the results provides the aggregate number of hardware event counts for the two hardware events for the entire run of the workload. You can use the filtering capabilities of Intel VTune Amplifier XE GUI to view the event counts for specific hot loops or functions. At this point, you can simply plug in the values to calculate the FLOP/s for your workload.
Ensure that the workload runs long enough:The Intel VTune Amplifier XE results can be statistically invalid if your application runs for a very small duration. In this case, you can either improve the statistically validity of the results by increasing the runtime of your workload or by decreasing the sample-after value of the hardware event counters. Please note that a sample-after value of less than 10,000 is not recommended and can produce unexpected results. You can read more about this issue at http://software.intel.com/en-us/blogs/2013/05/29/sanity-check-statistical-data-validity-of-intel-vtune-amplifier-xe-results
VPU_ELEMENTS_ACTIVE counts events other than floating-point operations: Note that the VPU_ELEMENTS_ACTIVE hardware event counts all the vector operations executed and not just the floating point operations. Hence, the numbers reported by VPU_ELEMENTS_ACTIVE include vector memory operations such as scatter-gathers, vector masking operations as well as other special purpose vector operations in addition to the vector floating-point operations. Hence, in general, the performance counters will have a tendency to over-estimate FLOP/s. Also note that the presence of a large number of gather-scatter operations in the code can cause VPU_ELEMENTS_ACTIVE to significantly over-estimate the number of FLOPs.
Fused-Multiply-Add (FMA) instructions are counted as one floating-point operation: The hardware event counters treat the FMA instructions as a single floating-point operation instead of two operations. This should be taken into consideration if your workload contains a large number of FMA operations.
CPU_CLK_UNHALTED may vary from elapsed time: Some hardware event counters count only when the core is not asleep. Hence it is important to ensure your application is running at full power and using all the threads during the entire run. If your core halts then CPU_CLK_UNHALTED will not equal elapsed time. Also, the power states (P-states and C-states) can affect the elapsed time measured by CPU_CLK_UNHALTED. It may be helpful to disable P-states and C-states. Hence, in general, it is safer to use wall-clock time to measure time.