Performance figures differ massively between loaded vs unloaded machine

Performance figures differ massively between loaded vs unloaded machine

Hi

I have been evaluating the latest Intel C++ compiler with a view to improving our COM business objects.

After having successfully built all the necessary DLLs I proceeded to performance test them in comparison
to the Microsoft compiled versions. The tests involved my own benchmarking code as well as running the application
through an evaluation of Intel VTune Analyzer.

Both benchmark tests on my development machine (which is loaded down with the usual corporate AV / big brother apps / Outlook / etc. ) showed between a 15% and 25% improvement on performance.

However, when I then did the same test on a vanilla (i.e. clean Windows install) machine (of the same chipset / processor flavour) I saw no improvement whatsoever..

Looking at the VTune results I notice on my dev machine I noticed the Clockticks % was around 50% (i.e. a lot of other apps were contending for slices), whereas on the plain vanilla machine the Clockticks % was always >=95%.

My only conclusion was that on a heavily loaded machine the Intel generated machine code is more efficient (thus quicker). One more strange issue is the VTune Instructions retired numbers for the Microsoft and the Intel compiled code were the SAME on the vanilla machine; I was under the impression that the two compilers would generate different enough instructions / opcodes to show a marked difference in the instructions retired figures..

Help me resolve all this confusion :-)

Stuart

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - stuart.gillibrandaah.co.uk
Hi

I have been evaluating the latest Intel C++ compiler with a view to improving our COM business objects.

After having successfully built all the necessary DLLs I proceeded to performance test them in comparison
to the Microsoft compiled versions. The tests involved my own benchmarking code as well as running the application
through an evaluation of Intel VTune Analyzer.

Both benchmark tests on my development machine (which is loaded down with the usual corporate AV / big brother apps / Outlook / etc. ) showed between a 15% and 25% improvement on performance.

However, when I then did the same test on a vanilla (i.e. clean Windows install) machine (of the same chipset / processor flavour) I saw no improvement whatsoever..

Looking at the VTune results I notice on my dev machine I noticed the Clockticks % was around 50% (i.e. a lot of other apps were contending for slices), whereas on the plain vanilla machine the Clockticks % was always >=95%.

My only conclusion was that on a heavily loaded machine the Intel generated machine code is more efficient (thus quicker). One more strange issue is the VTune Instructions retired numbers for the Microsoft and the Intel compiled code were the SAME on the vanilla machine; I was under the impression that the two compilers would generate different enough instructions / opcodes to show a marked difference in the instructions retired figures..

Help me resolve all this confusion :-)

Stuart

Stuart,

My guess is you are building one set of DLL's but testing with a different set of DLL's(reference to same number of instructions retired).

Jim Dempsey

www.quickthreadprogramming.com

Jim,

I've double checked everything and I'm definitely using the correct DLLs. Would I be better adjusting the sampling settings - i.e. use time based instead of event based?

Generally speaking though, is it not feasible to assume that Intel generated code may show more of a performance improvement on a heavily loaded machine. For example say a 20% increase on a busy machine versus a 2% increase on a machine that's generally doing nothing?

Stuart

Quoting - jimdempseyatthecove

Stuart,

My guess is you are building one set of DLL's but testing with a different set of DLL's(reference to same number of instructions retired).

Jim Dempsey

Quoting - stuart.gillibrandaah.co.uk


Generally speaking though, is it not feasible to assume that Intel generated code may show more of a performance improvement on a heavily loaded machine. For example say a 20% increase on a busy machine versus a 2% increase on a machine that's generally doing nothing?

It's possible you got lucky with a compiler optimization reducing the memory or cache demand of your application, such that it competes better with other applications. This would lie in details of which there's no evidence presented. No doubt the software salesmen would be quick to take credit.

That (reduced memory/cache demand) sounds reasonable.

I decided to increase the sampling frequency in VTune and got some more definitive results. I compared MS built DLLs (with /O2 and Full Program Optimisations) with Intel built (with /O3 SSE3 NO Parallel).

I performed each run twice and found a 0.2-0.5% variance in my results (between the same compiler flavour) so I'm confident the results are accurate enough to use for comparison.

There are 4 DLLs in total and the MS DLLs returned:-

#1
[20019120tick events]
[12484800inst. retired events]
1.6 CPI

#2
[53839300 tick events]
[34593300 inst. retired events]
1.56 CPI

#3
[129365980 tick events]
[77813250 inst. retired events]
1.66 CPI

#4
[153252430 tick events]
[84619200 inst. retired events]
1.81 CPI

---

Intel:-

#1
[39052450 tick events]
[27440550 inst. retired events]
1.42 CPI

#2
[94332520 tick events]
[66412200 inst. retired events]
1.42 CPI

#3
[202769420 tick events]
[137983050 inst. retired events]
1.47 CPI

#4
[325234870 tick events]
[209337150 inst. retired events]
1.55 CPI

---

As can be seen, the Intel code generates 50% or more instructions (although they are more efficient as far as their clock cycles go). I'm now wondering why there is such a big difference in instructions. FYI the machine is a Core 2 E6300. Is this a multi-core effect?

Quoting - tim18
It's possible you got lucky with a compiler optimization reducing the memory or cache demand of your application, such that it competes better with other applications. This would lie in details of which there's no evidence presented. No doubt the software salesmen would be quick to take credit.

The next question: can you characterize the additional events generated by the Intel compiler build? For example, if there were more cache misses, but fewer (or no more) cache line misses retired, it might indicate that the Intel build is successful in initiating more cache line fills in advance of when they are consumed, but using the cache lines more effectively once they are fetched.

When using timer based sampling on loaded system

Use a system level view of the timer ticks to get the ratio of time spent in your application vs non-application.
Then pro-rate the run-to-run figures for your applicaiton.

If your test run is sufficiently long (several seconds) then the total tick counts for (various routines in)your application should remain consistant from run to run but the percentages may vary.

Jim Dempsey

www.quickthreadprogramming.com

Thanks; I've added L2_LINES_IN and MEM_LOAD_RETIRED.L2_LINE_MISS to my activity run.

I'll post the results up later.

Quoting - tim18
The next question: can you characterize the additional events generated by the Intel compiler build? For example, if there were more cache misses, but fewer (or no more) cache line misses retired, it might indicate that the Intel build is successful in initiating more cache line fills in advance of when they are consumed, but using the cache lines more effectively once they are fetched.

Thanks Jim, I'm using event based sampling at the moment and when Isaid"increased the frequency" I was referring to lowering the SAV.

If I start using time based sampling to further diagnose my issue I'll make sure I pro-rata the results to align any discrepencies.

Stuart

Quoting - jimdempseyatthecove

When using timer based sampling on loaded system

Use a system level view of the timer ticks to get the ratio of time spent in your application vs non-application.
Then pro-rate the run-to-run figures for your applicaiton.

If your test run is sufficiently long (several seconds) then the total tick counts for (various routines in)your application should remain consistant from run to run but the percentages may vary.

Jim Dempsey

Stuart,

You should generally make your profile runs on an un-loaded system as other app activity will contaminate your cache.

This said, you should also make additional runs with a loaded system to simulate what to expect under normal operating conditions.

Of particular interest is catching:

When program runs well on un-loaded system while running disproportionately poorly on loaded system.
(e.g. 50% load by other app(s) cause more than 2x slowdown in your app).

The typical culprit is often the case where a loop is divided up evenly amoungst the application threads but due to loading on the system by other applications one or more threads of the application get pre-empted (have to share time with other app), and thus you encounter stall times for the other threads.

This can be mitigated to some extent by using chunk sizes and/or non-static scheduling. But these methods induce additional overhead when running un-loaded. Win some, loose some.

Another cause for this is, assuming your app under development is multi-threaded, and the other applications inducing the load is/are multi-threaded. But neither programmer is willing to be nice to other programs on the system and set a relatively long block time (time a thread is permitten to spin while waiting for next thing to do). When multiple applications do this on a system you can get excessive wasted time. e.g. and 8 threaded loop performing static scheduling where one thread gets delayed 10ms will cause the seven other threads to get delayed 10ms each (wasting 70ms of processing capacity). When the compute time of the loop is relatively short, and if both apps use a 200ms block time then it is concievable that worst case, each app could begin the loop every 400ms (less time to compute loop). So while block time may help a single mt app run faster, it may severely degrade performance when multiple mt apps use the block times.

This should be justification for making the setting of the block time a dynamic feature (instead of once only at startup).

Jim Dempsey

www.quickthreadprogramming.com

Here are the results (I've concentrated on just 1 DLL this time (#4 from the previous example)):-

Intel DLL:

clocktick events = 320002600
instructions retired = 208296750
cpi = 1.5
MEM_LOAD_RETIRED.L2_LINE_MISS events = 133062
L2_LINES_IN.SELF.ANY events = 1263426
L2 Cache Miss Rate = 0.006

MS DLL:

clocktick events = 156513120
instructions retired = 81974850
cpi = 1.9
MEM_LOAD_RETIRED.L2_LINE_MISS events = 94335
L2_LINES_IN.SELF.ANY events = 938690
L2 Cache Miss Rate = 0.011

---

I'm still learning how to fully interpret this data but based upon your (Tim's) statement:

"it might indicate that the Intel build is successful in initiating more cache line fills in advance of when they are consumed, but using the cache lines more effectively once they are fetched"

I'd say the Intel DLL is indeed filling more cache lines but it doesn't look as though it's using them more effectively, unless of course I'm monitoring the wrong events..

Thanks once again for your insight.

Stuart

Quoting - stuart.gillibrandaah.co.uk

Thanks; I've added L2_LINES_IN and MEM_LOAD_RETIRED.L2_LINE_MISS to my activity run.

I'll post the results up later.

Intel clock tick events 320002600
MS clock tick events 156513120
Ratio ~2.045:1
Intel L2 miss 133062 / 2.045 = 65067
MS L2 miss 94335

(this is reflected in the L2 cache miss rate figures)

What are the wall clock times for the same test app linked with the respective DLLs?
Are the test parameters the same for each?

Jim

www.quickthreadprogramming.com

Quoting - stuart.gillibrandaah.co.uk

Here are the results (I've concentrated on just 1 DLL this time (#4 from the previous example)):-

Intel DLL:

clocktick events = 320002600
instructions retired = 208296750
cpi = 1.5
MEM_LOAD_RETIRED.L2_LINE_MISS events = 133062
L2_LINES_IN.SELF.ANY events = 1263426
L2 Cache Miss Rate = 0.006

MS DLL:

clocktick events = 156513120
instructions retired = 81974850
cpi = 1.9
MEM_LOAD_RETIRED.L2_LINE_MISS events = 94335
L2_LINES_IN.SELF.ANY events = 938690
L2 Cache Miss Rate = 0.011

---

I'm still learning how to fully interpret this data but based upon your (Tim's) statement:

"it might indicate that the Intel build is successful in initiating more cache line fills in advance of when they are consumed, but using the cache lines more effectively once they are fetched"

I'd say the Intel DLL is indeed filling more cache lines but it doesn't look as though it's using them more effectively, unless of course I'm monitoring the wrong events..

I'm a little surprised by this interesting result. I thought you indicated that the Intel build retired more instructions in a shorter time, which is not supported by these figures.
I have struggled myself with many cases where the compiler used too many Instructions to accomplish the job, resulting in a low CPI value in spite of increased Clocks, such as you show with the latest figures. It's usually evident when comparing asm views what makes the difference.
I didn't say it explicitly, but I'm wondering if the Intel build is generating more total cache miss events (more misses per cache line), and those may account for additional instructions.

Tim / Jim,

The wall time is practically the same (within 1%) for each run, I could pro-rata the figures but it would make a marginal difference and obviously wouldn't affect the ratios.

FYI the actual image (DLL) sizes are MUCH larger for the Intel build, probably proportional to the increase in instruction count, so as you say Tim maybe I need to look at the generated ASM. I initially put that down to agressive inlining rather than anything else but maybe not.

To clarify I've been comparing the two builds on the same unloaded machine recently, as such
the actual "user experience" of performing the test runs show no time difference (and they BOTH have the same input / usage parameters). However as stated earlier, running the same scenario on a much more loaded machine shows the Intel images do on averagehave a 20% improvement (which is what you may be referring to Tim).

I am going to try a time based sample run later today and pro-rata the figures based on system wide time to see what the sample numbers say, although I'm expecting the same results.

Are there any other events or ratios that I can monitor that might help me further with this?

Stuart

Sruart,

I ran into an interesting problem (from my viewpoint). I have a static .LIB of four source files (one is stdafx.cpp). The header files conatin templates. When I compiled and linked for Debug build the resultant .lib file was ~500KB. When compiled for Release build the resultand .lib file was 7.5MB. A rather large explosion in footprint.

At some point in time, for reasons unknown, I was watching the Output from Build window (VS 2005) and noticed the compiler would produce its summary output then Link would start, then the compiler ran again!.

As it turned out, what was happening was Inter-Procedural Optimizatons (IPO) were causing

compile -> link -> recompile -> relink ->...

When turning off IPO (in linker, andI turned it off in C++ too for kicks), the compile occures once and I now produce a ~500KB release version.

Also, this bypassed the generation of bad code with IPO enabled (previously corrected with
#pragma intel optimization_level 0)

Try disabling IPO in both C++ and Link property sheets. If that does wonders, then try reenabling on C++ alone.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today