I am trying to use Parallel Amplifier to evaluate performance of two different ways of building a particular application. Out in "real life" on a particular workload I can see that the proposed new way is faster on many workloads but slower on a few, and I am trying to evaluate one of those slower cases. It's not helping. I ran Hotspots Analysis using the VTune start/stop API to time only the actions of interest. In the results IPA shows certain functions as taking 50-80% of the bottom-up time. The specific function shown does not seem like it could possibly be that slow, and the identity of the function changes each time I rerun the workload on either one of the applications. What could be going wrong?
(I tried VTune first but for some reason it is not willing to generate any call stack data for the second executable, although it worked fine for the first one. The main difference between the two executables is that the second one uses msvcrt instead of libcmt.)
In case it matters, I am running IPA on Win7 Enterprise x64 on 64-bit executables; the VTune run was on a different machine with XP x64.