Best Know Method: Estimating FLOP/s for workloads running on the Intel Xeon® Phi™ coprocessor using Intel® VTune™ Amplifier XE

One of the popular metrics that is frequently used to estimate performance is FLOP/s. This document shares the results of our experience with using Intel VTune Amplifier XE to estimate FLOP/s. 


One of the key features of the Intel Xeon Phi coprocessor is its 512-bit Vector Processing Unit (VPU). All the floating-point operations on the coprocessor, whether vector or scalar, are executed by the vector unit. Hence, we can track the number of floating-point operations executed in a workload by simply monitoring the VPU. For the case of this study, we will be using the VPU_ELEMENTS_ACTIVE hardware event to estimate the number of floating-point operations in the workload. The VPU_ELEMENTS_ACTIVE hardware event couner counts the number of vector operations executed. To further clarify, consider the following example: if a floating point vector multiply was performed on 8 doubles fully packed into one register then the VPU_ELEMENTS_ACTIVE would be incremented by 8 for that instruction. Hence, in theory, you can calculate the FLOP/s as: 


The other piece of information that is needed to estimate FLOP/s is the amount of time spent by application in performing the floating-point operations.  You can either collect this information directly by using gettimeofday or something similar. Alternately, you could use the CPU_CLK_UNHALTED hardware event to calculate the time spent in performing the floating-point operations. The actual time can be calculated from hardware event counts as follows: 

Time= (CPU_CLK_UNHALTED)/((#threads)*Frequency); 

Generally, it is safer to use wall-clock time instead of the CPU_CLK_UNHALTED hardware event because the CPU_CLK_UNHALTED is affected by a number of factors as explained later on in this article. 


To collect the two hardware events: VPU_ELEMENTS_ACTIVE and CPU_CLK_UNHALTED you can either use the Intel VTune Amplifier XE GUI or command line.
If you choose to use the GUI, you will need to create a custom analysis for the coprocessor to collect these two events. Please refer to the Intel VTune Amplifier XE tutorials on (under Trainings > Tutorials) for more information. 
To run the collection using the command line, please execute the following command on the host: 
Native Workload: 
$/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED, VPU_ELEMENTS_ACTIVE -- ssh mic0 "<Set up the environment>; ~/a.out" 
The above command line assumes that you have already transferred the binary and any other necessary files (e.g. libraries) over to the coprocessor:
Offload Workload: 
$/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED, VPU_ELEMENTS_ACTIVE -knob target-cards=0 -- ./a.out
The above command line assumes that you have already set up the environment variables on the host before running this command. 

Once the profiling run is complete, the results can be viewed using the Intel VTune Amplifier XE GUI. The summary view of the results provides the aggregate number of hardware event counts for the two hardware events for the entire run of the workload. You can use the filtering capabilities of Intel VTune Amplifier XE GUI to view the event counts for specific hot loops or functions. At this point, you can simply plug in the values to calculate the FLOP/s for your workload. 


Ensure that the workload runs long enough:The Intel VTune Amplifier XE results can be statistically invalid if your application runs for a very small duration. In this case, you can either improve the statistically validity of the results by increasing the runtime of your workload or by decreasing the sample-after value of the hardware event counters. Please note that a sample-after value of less than 10,000 is not recommended and can produce unexpected results. You can read more about this issue at

VPU_ELEMENTS_ACTIVE counts events other than floating-point operations: Note that the VPU_ELEMENTS_ACTIVE hardware event counts all the vector operations executed and not just the floating point operations. Hence, the numbers reported by VPU_ELEMENTS_ACTIVE include vector memory operations such as scatter-gathers, vector masking operations as well as other special purpose vector operations in addition to the vector floating-point operations. Hence, in general, the performance counters will have a tendency to over-estimate FLOP/s. Also note that the presence of a large number of gather-scatter operations in the code can cause VPU_ELEMENTS_ACTIVE to significantly over-estimate the number of FLOPs. 

Fused-Multiply-Add (FMA) instructions are counted as one floating-point operation: The hardware event counters treat the FMA instructions as a single floating-point operation instead of two operations. This should be taken into consideration if your workload contains a large number of FMA operations. 

CPU_CLK_UNHALTED may vary from elapsed time:  Some hardware event counters count only when the core is not asleep. Hence it is important to ensure your application is running at full power and using all the threads during the entire run. If your core halts then CPU_CLK_UNHALTED will not equal elapsed time. Also, the power states (P-states and C-states) can affect the elapsed time measured by CPU_CLK_UNHALTED. It may be helpful to disable P-states and C-states. Hence, in general, it is safer to use wall-clock time to measure time. 

Compiled code could be very different from source: Current compilers are sophisticated programs that will try to optimize code at every opportunity. This can result in a code that is significantly different from the source. As a result, there can be difference in the number of floating-point operations as counted by the programmer from the source code compared to the real number of floating-point operations used by the compiler. If you suspect the value of FLOP/S measured by events for your workload, you may choose to hand-count the average number of FLOP/s by viewing the assembly code generated. Intel VTune Amplifier XE can also be used as a source/assembly viewer by double-clicking on hot functions or loops.

Lastly, a hard learned lesson: It is always a good idea to confirm the validity of your metric by writing a small bit of test code whose result you know.