I'm trying to compare two implementations of a particular function for their performance in terms of cpu time and floating-point instructions-retired. I'd prefer not to use any kind of stochastic sampling, I just want to know how many cycles and how many flops elapsed between point A and point B in my code, where this fragment will be executed many times in a single program run.
Unless I'm mis-reading everything, VTune's sampling is stochastic, either time-based or event-based. Is there a way to make VTune's sampling _exhaustive_, so I get the total # of instructions/flops in a function?
I am including VTuneApi calls at the beginning and end of the function to resume and pause data collection.
Really I'm looking for something very much like PAPI (http://icl.cs.utk.edu/papi/), which doesn't support Windows/P4 machines. I'm hoping VTune can deliver this functionality.