Code Coverage Performance in Intel® Parallel Composer

Hi All, first time blogging on ISN.  I want to get right into this, so I'll save formal introductions for another time, but there should be a short bio on my profile if you're interested.

I was recently working a customer who was playing around with Intel® Parallel Composer's code coverage feature. I can't publish his code here for obvious reasons, but it was a fairly simple test case designed to identify prime numbers within a sequential range. When he built this code with /Qcov-gen enabled to create an application instrumented to provide code coverage profiling information, the application runtime went from around 7 seconds to around 65 seconds. He naturally wondered if this was a problem or if he was doing something wrong. This ended up being a really interesting question that I think shows some interesting details on the differences between code coverage and profile-guided optimizations, and also shows how a quick use of Intel® Parallel Amplifier can reveal the source of problems quickly.

First is the question of what should be the expected performance of an application when it is instrumented for generating code coverage or profiling information for Profile Guided Optimizations (PGO - available with the /Qprof-gen option in the Intel® C++ Compiler professional edition). We don't have any explicit internal goals ourselves on the performance here, and we haven't received any feedback from customers (until this one) that would give us any indication that this is an issue at this point. Performance generally varies - I've seen cases where almost no overhead exists and some where you can see up to a 5x increase in runtime. It's reasonable to expect some increase as a rule, due to two reasons:

1. Instrumenting the application to output profiling data is just going to naturally add time. You have all this I/O as the application writes this runtime data to a file as it runs.

2. The compiler disables many optimizations when creating this instrumented application. Inlining for example, needs to be completely disabled if you're going to get any accurate sense of the functions being called either for code coverage or for profile-guided optimizations.

Now, runtime of these instrumented applications are usually not nearly as important as the runtime of the optimized application since the instrumented application is used in more one-off situations for programmer/build reasons (either once to get the code coverage data, or periodically to generate profiling data for PGO). Even so, there is likely some threshold where the run time of the instrumented application would become inconvenient and eventually untenable. If anyone has specific feedback here, comments would be welcome.

Since this customer was interested in this, I decided to dig deeper. I ran Intel® Parallel Amplifier on the instrumented application, and immediately saw the following:

Screenshot of Intel® Parallel Amplifier showing _PGOPTI_Prof_Div_VP as a hotspot

As you can see, most of the time was being spent in a function _PGOPTI_Prof_Div_VP. It turns out that this particular routine comes from our profiling runtime library libipgo.lib and handles emitting value profiling information on divides. Value profiling is where the compiler tries to identify variables that seem to be constant or in a certain range for most runs. For those variables, the compiler can make some optimization choices to generate better code utilizing those variables when PGO is enabled. However for code coverage, this information is unnecessary. Here's where I get into a little Intel compiler history lesson.

A long time ago the Intel compiler offered PGO as a feature. This was enabled with /Qprof_gen (for generating the instrumented application) and then with /Qprof_use (for optimizing with the profiling data). Soon after, /Qprof_genx was introduced to enable code coverage. /Qprof_genx was /Qprof_gen with extra functionality to create some of the static data used for code coverage (in .spi files). This stayed the compiler status quo for several versions. In the Intel® C++ Compiler 11.0 Professional Edition, /Qprof_genx was deprecated and a new implementation of /Qprof-gen was created to add some different ways of optimizing using PGO-data. /Qprof-gen:srcpos was now the option to use for doing code coverage, however again, it still does all the PGO-related stuff that /Qprof-gen does.

With the release of Intel® Parallel Composer, a new option was introduced - /Qcov-gen. This was necessary because Intel Parallel Composer was not going to offer PGO optimizations, but we still wanted to be able to offer code coverage functionality in Parallel Composer. Now again, under the hood, /Qcov-gen is really just /Qprof-gen:srcpos. However, now that code coverage has its own option independent of any expectation of having profiling data usable for PGO optimizations, we have the opportunity to remove any functionality along those lines that doesn't impact code coverage. So I have a feature request now to disable value profiling under the /Qcov-gen option which should have some dramatic impact on some instrumented application times - good news for our code coverage users that don't use PGO.

For more complete information about compiler optimizations, see our Optimization Notice.