Inaccurate results from Parallel Amplifier on a large program

Inaccurate results from Parallel Amplifier on a large program

Hello,

I'm seeing very inaccurate results produced by Parallel Amplifier on a large program. For golden results, I'm using VTune in sampling mode for CPU_CLK_UNHALTED.CORE counter. I also did some double-checking using internal timers inside the program to make sure VTune numbers are in the right ballpark.

SETUP
The program runs about 30 minutes on its own and under the Amplifier. It is a 64-bit executable that consists of over 200 DLLs. It is compiled from a few million lines of code. Early parts of the program exercise one set of DLLs, middle exercise another, and the end exercises another set of DLLs. The hardware is QuadCore Xeon with 8 GB of RAM. The peak virtual memory (include all the code, not just data) is just under 5 GB.

EXPERIMENT 1
With default Amplifier settings, I got completely inaccurate data. The top two DLLs marked by the Amplifier actually take less than 2% of the program's time. The size of the Amplifier data dir was 19 GB.

EXPERIMENT 2
When I increased maximum size of raw collector data to 1000 MB (from the default of 10 MB), the results came closer. The top DLL was said to take 23% and in VTune it takes 19%. So close enough. However, another DLL where almost 25% of time is spent (again, according to VTune) did not even show up on the Amplifier's list. The size of the Amplifier dir went up to 20 GB.

EXPERIMENT 3
Next thing I did was to check off "Enable accurate CPU Time detection" and kept 1000 MB raw data limit. The numbers became better and more DLLs showed up. However, the top DLL consumer according to VTune, got only 2.5% in Amplifier. The Amplifier dir size was now 49 GB.

I should mention that VTune's data dir for sampling of two counters (without calibration) is 241 MB.

QUESTIONS
1. Having maximum limit set too low seems to cut off data collection. How do we know when this happens? I couldn't find any indicators that this limit was reached?

2. Is there anything else I can do to improve the accuracy of the Amplifier?

21 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Peter Wang (Intel)'s picture

Quoting - Dmitry
Hello,

I'm seeing very inaccurate results produced by Parallel Amplifier on a large program. For golden results, I'm using VTune in sampling mode for CPU_CLK_UNHALTED.CORE counter. I also did some double-checking using internal timers inside the program to make sure VTune numbers are in the right ballpark.

SETUP
The program runs about 30 minutes on its own and under the Amplifier. It is a 64-bit executable that consists of over 200 DLLs. It is compiled from a few million lines of code. Early parts of the program exercise one set of DLLs, middle exercise another, and the end exercises another set of DLLs. The hardware is QuadCore Xeon with 8 GB of RAM. The peak virtual memory (include all the code, not just data) is just under 5 GB.

EXPERIMENT 1
With default Amplifier settings, I got completely inaccurate data. The top two DLLs marked by the Amplifier actually take less than 2% of the program's time. The size of the Amplifier data dir was 19 GB.

EXPERIMENT 2
When I increased maximum size of raw collector data to 1000 MB (from the default of 10 MB), the results came closer. The top DLL was said to take 23% and in VTune it takes 19%. So close enough. However, another DLL where almost 25% of time is spent (again, according to VTune) did not even show up on the Amplifier's list. The size of the Amplifier dir went up to 20 GB.

EXPERIMENT 3
Next thing I did was to check off "Enable accurate CPU Time detection" and kept 1000 MB raw data limit. The numbers became better and more DLLs showed up. However, the top DLL consumer according to VTune, got only 2.5% in Amplifier. The Amplifier dir size was now 49 GB.

I should mention that VTune's data dir for sampling of two counters (without calibration) is 241 MB.

QUESTIONS
1. Having maximum limit set too low seems to cut off data collection. How do we know when this happens? I couldn't find any indicators that this limit was reached?

2. Is there anything else I can do to improve the accuracy of the Amplifier?

Hi Dmitry,

It seems that you ran a huge application on Parallel Amplifier. Result of Experiment 1 was inaccurate - it does make sense, because you had 10 MB size for raw collector data perresult directory, the data collection will be SUSPENDED once the memory of 10 MB was reached.

You did good change to increase limited memory size (1000 MB) for raw datadirectory, so data collector will have only a few times to suspend - Experiment2.

You also did right thing to enable "Enable accurate CPU time detection" - which includes post-processing time.

Have you used "Remove raw collector data after result finalization"? - which helps to remove raw data and makes result files smaller.

A1. The user doesn't know when data collector is suspended, when memory use reaches limit set.

A2. You can use "Start data collection paused" and use resume button on GUI, or use "Resume collection after n sec" - to collect data in your interest of code region. If the code region is small peice, you can use 10MB space for raw data.

Regards, Peter

Quoting - Peter Wang (Intel)

A1. The user doesn't know when data collector is suspended, when memory use reaches limit set.

A2. You can use "Start data collection paused" and use resume button on GUI, or use "Resume collection after n sec" - to collect data in your interest of code region. If the code region is small peice, you can use 10MB space for raw data.

Regards, Peter

Thanks Peter.

Regarding A1: The Amplifier should somehow tell the user when the maximum raw data size is reached. Otherwise, the user will try to optimize the wrong thing, wasting his/her time.

Regarding A2: does it mean there is no way to correctly profile huge application in full? Often, the first step in optimizing is doing high-level profiling (like I tried to do here), and only then profile the slowest code in detail. In this case, by hiding the most critical DLL, Amplifier would lead me to optimize the wrong thing.

I am quite impressed with low overhead and easy-of-use of the Parallel Amplifier. However, I need to know when the Amplifier reaches its limits and starts producing bogus data. Otherwise, when do I know when to trust the tool?

If you think something can be improved in the Amplifier, would you be interested in getting your hands on my program and giving it a try yourself? It's not open source so we'll have to establish an NDA but it shouldn't be too much work. I really like the idea of the Amplifier and want to help to make it better.

Peter Wang (Intel)'s picture

Hi Dmitry,

Yes.In most of timethe user doesn't know when she/he should resume / pause, since doesn't knowwhere code runs...

I appreciate your opinions - actually it is hard for the user to select maximum size of memory for raw collector data, because the user doesn't know 10MB, 100MB, 1000MB - which one is proper for her/his program.

If we have some mechanism like as VTune Analyzer's "Calibration" to help the user to set maximum size of memory for raw collector data, it will be helpful. In case, Parallel Amplifier will run twice, first time not collect performance data...

Let's hear inputs from others.

Thanks again for comments.

Regards, Peter

vladimir-tsymbal (Intel)'s picture

Quoting - Dmitry

QUESTIONS
1. Having maximum limit set too low seems to cut off data collection. How do we know when this happens? I couldn't find any indicators that this limit was reached?

2. Is there anything else I can do to improve the accuracy of the Amplifier?

1. Once the raw data size limit is reached, Amplifier throw a message in the output window signaling that the collection was stopped. You canfigure out the current size of data by looking at size of data.0 directory in the results folder. The size of the whole result folder doesn't say anything.

2. Checking the switch "Enable the accurate CPU Time detection" is enough for achieving decent accuracy. But you have to be aware of difference in data collection between VTune and Amplifier. Amplifier collects CPU time only, whereas VTune attributes waiting time (thread blocked on synchronization, IO call, etc.) to the module as well. Switching on the button "Assign system function time to caller user function..." mitigates the problem with IO, but not likely the others.

-Vladimir

Quoting - Vladimir Tsymbal (Intel)
1. Once the raw data size limit is reached, Amplifier throw a message in the output window signaling that the collection was stopped. You canfigure out the current size of data by looking at size of data.0 directory in the results folder. The size of the whole result folder doesn't say anything.

2. Checking the switch "Enable the accurate CPU Time detection" is enough for achieving decent accuracy. But you have to be aware of difference in data collection between VTune and Amplifier. Amplifier collects CPU time only, whereas VTune attributes waiting time (thread blocked on synchronization, IO call, etc.) to the module as well. Switching on the button "Assign system function time to caller user function..." mitigates the problem with IO, but not likely the others.

1. Looking at the output window doesn't really work for me. My program is a command-line utility that spits thousands of lines of output. A few lines from Amplifier would get lost easily. Dispalying a message when viewing the results would be best.

2. My program is not IO intensive and only a part of it is parallelized. Most of the time is spent hammering CPU and RAM. I already had "Assign system function time to caller..." turned on.

I did another experiment: turned off "Assign system function time to caller..." and re-run the Amplifier. Just by chance, I was able to catch this part in the output window:

repositorytpsstpsssrctpssruntoolwindowscswitch_collector.cpp:955 tpss::processEvent: Assertion 'tIt != g_thrd_map->end()' failed.

My program kept going afterwards but no results were presented by the Amplifier.

vladimir-tsymbal (Intel)'s picture

Quoting - Dmitry

1. Looking at the output window doesn't really work for me. My program is a command-line utility that spits thousands of lines of output. A few lines from Amplifier would get lost easily. Dispalying a message when viewing the results would be best.

Amplifier's collector and viewer are different applications. Collection data size limitation is a function of the collector. So, it's what we have to deal with. You may just search for the message: "The data collection stopped since the data size limit of (N Mb) is reached. The application is running but no data is collected." in the output.

Quoting -Dmitry

2. My program is not IO intensive and only a part of it is parallelized. Most of the time is spent hammering CPU and RAM. I already had "Assign system function time to caller..." turned on.

I did another experiment: turned off "Assign system function time to caller..." and re-run the Amplifier. Just by chance, I was able to catch this part in the output window:

repositorytpsstpsssrctpssruntoolwindowscswitch_collector.cpp:955 tpss::processEvent: Assertion 'tIt != g_thrd_map->end()' failed.

My program kept going afterwards but no results were presented by the Amplifier.

Actually, you do not need to re-run the collection. The button is applied to the results, not to collection.

Please, let me know the Update and build number of your Amplifier installation (Menu->Help->About Microsoft Visual Studio->Intel Parallel Amplifier).

-Vladimir

Quoting - Vladimir Tsymbal (Intel)

Please, let me know the Update and build number of your Amplifier installation (Menu->Help->About Microsoft Visual Studio->Intel Parallel Amplifier).

I just noticed that on the run that saw the Amplifier crash I enabled extra debug messages in my application. These messages are quite numerous and there is a chance their generation is not thread-safe. It's a wild guess but it might explain Amplifier crash. The crash went away when I disabled the debug messages.

The Amplifier version I have is Update 1, build 67513.

vladimir-tsymbal (Intel)'s picture

Quoting - Dmitry

I just noticed that on the run that saw the Amplifier crash I enabled extra debug messages in my application. These messages are quite numerous and there is a chance their generation is not thread-safe. It's a wild guess but it might explain Amplifier crash. The crash went away when I disabled the debug messages.

The Amplifier version I have is Update 1, build 67513.

OK, thanks. I hope this is fixed in the Update 2 which is coming very soon.

-Vladimir

With Update 2, the inaccuracy is gone! Now VTune and Amplifier results (time allocation to each DLL) match to within 1%. Good job!

The crash with debug messages is still there but in a different form (and possibly different place). Now I got the option to submit a fancy crash report, so I did. I also put a note on what I saw in Update 1.

vladimir-tsymbal (Intel)'s picture

Quoting - Dmitry
With Update 2, the inaccuracy is gone! Now VTune and Amplifier results (time allocation to each DLL) match to within 1%. Good job!

The crash with debug messages is still there but in a different form (and possibly different place). Now I got the option to submit a fancy crash report, so I did. I also put a note on what I saw in Update 1.

Thank you, Dmitry for the report. We have received it and will be investigating.
Do you happen to know the memory footprint for your application and VS during the collection?

-Vladimir

Quoting - Vladimir Tsymbal (Intel)

Thank you, Dmitry for the report. We have received it and will be investigating.
Do you happen to know the memory footprint for your application and VS during the collection?

Without debug msgs, the peak VM of the app was 5,000 MB under the Amplifier (peak VM is calculated by the app by querying GlobalMemoryStatusEx() once in a while). It shouldn't be much different with debug messages.

As for devenv.exe, I don't know. If you really need it, I can find out by re-running the experiment and watching Task Manager.

vladimir-tsymbal (Intel)'s picture

Quoting - Dmitry

Without debug msgs, the peak VM of the app was 5,000 MB under the Amplifier (peak VM is calculated by the app by querying GlobalMemoryStatusEx() once in a while). It shouldn't be much different with debug messages.

As for devenv.exe, I don't know. If you really need it, I can find out by re-running the experiment and watching Task Manager.

Thanks Dmitry. I'd be more interested in exact stage of the analysis the crash occurred with Update2: Data collection (while application is running), trace-file reading, Finalizing results,.. etc (all those phases are indicated in the main window). Could the crash be persistently reproduced? Any messages in the output window?

-Vladimir

Quoting - Vladimir Tsymbal (Intel)

Thanks Dmitry. I'd be more interested in exact stage of the analysis the crash occurred with Update2: Data collection (while application is running), trace-file reading, Finalizing results,.. etc (all those phases are indicated in the main window). Could the crash be persistently reproduced? Any messages in the output window?

The problem is reproducible, even after I disabled all parallelism in my program (so it's a single process, single thread app). The problem occurs during data collection stage.

I can't tell if there were any messages in the output window. As soon as the crash happens, VS window disappears and the crash report window appears. If there is a way to redirect Amplifier output to a file (and make sure it's flushed all the time), I can try it again.

vladimir-tsymbal (Intel)'s picture

Quoting - Dmitry

The problem is reproducible, even after I disabled all parallelism in my program (so it's a single process, single thread app). The problem occurs during data collection stage.

I can't tell if there were any messages in the output window. As soon as the crash happens, VS window disappears and the crash report window appears. If there is a way to redirect Amplifier output to a file (and make sure it's flushed all the time), I can try it again.

Well, this is confusing me a bit. Amplifier doesn't generate a crash report if collector is failed. You mentioned that the crash accures during data collection (your application is still running), right? Actually, there is nothing to crash except collector itself at this stage. Unless you missed the exact stage, which is not your fault, just because you're naturally not familiar with the tool.

I have two things to propose.
1. If you can collect more crash reports and send to us, it would be nice to have
2. You can try to collect data from the command line and then try to open it form Visual Studio.
In order to run command line collection, just put the command:
path_to/ampl-runss.exe -r result_dir -- path_to/your_app.exe

It will save the results in the result_dir.
Then open the *.ampl fale from the result dir with Visual Studio

-Vladimir
vladimir-tsymbal (Intel)'s picture

Quoting - Vladimir Tsymbal (Intel)
In order to run command line collection, just put the command:
path_to/ampl-runss.exe -r result_dir -- path_to/your_app.exe

In ordr to use command line collector and be ble toopen results in GUI, you need to use ampl-cl.exe (not ampl-runss.exe)
The command line would be something like that:

>path_to/ampl-cl.exe -collect hotspots --no-auto-finalize-r result_dirpath_to/your_app.exe

For commnd line optionas help see:
>path_to/ampl-cl.exe -help

Sorry for the confusion.

-Vladimir

When I tried the collection from command line (using ampl-cl), my application ran to the end and the collector succeeded. I then successfully opened the results in VisualStudio. So the crash is specific to how Amplifier interfaces with VisualStudio.

When I changed bottom-up view to show functions grouped by module, VisualStudio went a bit crazy. It used all four cores on my CPUs for almost 3 hours, always kept a very large VM size (and it's a 32-bit exe, so it's a bit dangerous to have such a large memory footprint) and went through loooots of data. Below is the snapshop of taskmanager when I killed the devenv.exe (notice the 335 GB of I/O reads). The total size of the results directory is only 82 MB. Looks like the Amplifier has a quadratic (or worse!) loop somewhere.

When I re-run the collector from VisualStudio, I got the crash while my program was running. I am quite sure of it now. The crash came very soon, whereas it took almost an hour to do the collection from command line. When run from within VisualStudio, I did NOT see ampl-cl process, only ampl-runss.

I forgot mention: after all the crahes, the monitored process keeps going. I always have to track it down with Task Manager and kill it. That's a bit annoying.

vladimir-tsymbal (Intel)'s picture

Hi Dmitry,

This is quite new behavior of the tool that we never observed before. Thanks a lot for your investigations. We will appreciate if you continue sending crash reports to us.

-Vladimir

Quoting - Vladimir Tsymbal (Intel)

Hi Dmitry,

This is quite new behavior of the tool that we never observed before. Thanks a lot for your investigations. We will appreciate if you continue sending crash reports to us.

Would you be interested in my program? This way you can debug the crash and the scalability problem directly.

If you do, contact me privately at "ddenisen at altera dot com". We'll need to establish an NDA but that should not be a big problem.

Login to leave a comment.