I'm seeing very inaccurate results produced by Parallel Amplifier on a large program. For golden results, I'm using VTune in sampling mode for CPU_CLK_UNHALTED.CORE counter. I also did some double-checking using internal timers inside the program to make sure VTune numbers are in the right ballpark.
The program runs about 30 minutes on its own and under the Amplifier. It is a 64-bit executable that consists of over 200 DLLs. It is compiled from a few million lines of code. Early parts of the program exercise one set of DLLs, middle exercise another, and the end exercises another set of DLLs. The hardware is QuadCore Xeon with 8 GB of RAM. The peak virtual memory (include all the code, not just data) is just under 5 GB.
With default Amplifier settings, I got completely inaccurate data. The top two DLLs marked by the Amplifier actually take less than 2% of the program's time. The size of the Amplifier data dir was 19 GB.
When I increased maximum size of raw collector data to 1000 MB (from the default of 10 MB), the results came closer. The top DLL was said to take 23% and in VTune it takes 19%. So close enough. However, another DLL where almost 25% of time is spent (again, according to VTune) did not even show up on the Amplifier's list. The size of the Amplifier dir went up to 20 GB.
Next thing I did was to check off "Enable accurate CPU Time detection" and kept 1000 MB raw data limit. The numbers became better and more DLLs showed up. However, the top DLL consumer according to VTune, got only 2.5% in Amplifier. The Amplifier dir size was now 49 GB.
I should mention that VTune's data dir for sampling of two counters (without calibration) is 241 MB.
1. Having maximum limit set too low seems to cut off data collection. How do we know when this happens? I couldn't find any indicators that this limit was reached?
2. Is there anything else I can do to improve the accuracy of the Amplifier?