C++ Compiler performance

C++ Compiler performance

I have been benchmarking my core applications with successive generations of icc: 5.0, 6.0, 7.0, 7.1, 8.0. I found 5.0 was very fast in comparison to the gcc compilers of the time. However, while the performance of gcc has improved with each generation (more or less) the performance of the Intel compiler has degraded every generation. At first thedrop was small (but noticeable) 10% going from 5 to 6 and another 10% from 6 to 7. But with 8 the performance drop is much larger, so code compiled with 8 is barely half as fast as code compiled with 5 and is much slower than gcc (v3.2).

I must admit I havent played around an awful lot with compiler flags. Typically I would use -O3 -tpp7 -xW. All my experience with icc indicates it doesnt make much difference on my codes, so -O is about as good. These are numerically intensive but quite straightforward fluid dynamics codes, without much fancy programming, so I havent found IPO optimizations help much. I make comparisons using the same flags, kernel version and so on. The differences are so large I wonder if anyone else is seeing the same things or has any idea why the compiler performance should be so much worse today than 3 years ago.

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You've given part of the answer yourself. icc -O3 is not intended for rapid compilation, and it's not comparable with gcc -O3. gcc -O3 performs only limited in-lining in a forward direction only, and no vectorization. If you wanted run-time performance with gcc, you would at least turn on -funroll-loops, probably also -ffast-math. gcc -Os -ffast-math -funroll-loops -march=pentium4 -mfpmath=sse might reasonably be compared with icc -O1 -xW. According to the documentation, they aim to perform similar levels of optimization.

Sorry I was not more specific. I meant run time performance. Some of our applications run for 100's of hours on multiple cpus, so compile-time performance is completely irrelevant to us.

Let me add a few details. I have olayed around a bit with compiler options. Here are some benchmarks on a 3.0 GHz P4/800MHz FSB for different compilers using the best flag sets I have found in each case.

gcc 3.2 -O -ffast-math -funroll-loops -fomit-frame-pointer

icc default (No flags)

gcc

Let me add a few details. I have olayed around a bit with compiler options. Here are some benchmarks on a 3.0 GHz P4/800MHz FSB for different compilers using the best flag sets I have found in each case.

gcc 3.2 -O -ffast-math -funroll-loops -fomit-frame-pointer

icc default (No flags)

gcc

Let me add a few details. I have played around a bit with compiler options. Here are some run time benchmarks on a 3.0 GHz P4/800MHz FSB for different compilers using the best flag sets I have found in each case.

gcc3.2 14.8secs

icc5.0 14.6

icc6.0 15.2

icc7.1 15.6

icc8.0 25.8

For reference the flags were

gcc 3.2 -O -ffast-math -funroll-loops -fomit-frame-pointer

icc default (No flags)

-Ox, -tpp7, -xW etc either made no difference or slowed the code down

Message Edited by tjl on 01-11-2004 11:49 AM

I am experiencing the same problems here. I recently acquired the Intel C compiler 8 for windows, to compile a very floating point intensive application.

The application is a real-time Physic Engine. Most of my math operation are long dot products of two vectors. These vectors are aligned two 16 byte boundaries, and contiguous in memory.

When I compile my application with Intel 8.041 and compared the result to Visual Studio Net. I always get a net lost in performance. I have tried all kind of optimization flags for Intel C, and all I get is a larger but slower executable compared to VS.Net.

My only hope now is that I can use the multithreaded feature off the Intel C compiler, but as it stand now I wasted my investment.

Julio Jerez

Thanks

Message Edited by JulioJerez on 01-16-2004 05:35 PM

Sorry to hear you are having similar experiences. In the Linux world the solution is now simple. The latest versions of gcc (3.2) are about as fast on my code as any Intel compiler and much faster than 8.0. Other codes in our department are reportedly very fast with 8.0 so I guess its hard to figure out in advance.

Have you tried dumping the assembly for each version to see what the offending file might be?

one the compiler flags is "ipo" which does not produces and asm output.

I have compared the output of the Intel compiler to Microsoft one. this is what I get

- Intel will slow dow codethat uses SSE intrisics, significantly. this is becuase it try to vectorize code that is already vectorized and therefore end up adding aextra small loop, this result in a lost because of the extra loop for aligment.

- For code that does not have SSE intricsics the Intel compiler fail many time to recognize that the loop can be vectorized and the reult is code very similar to MS.

- intel replace conditional compare with some specealized SSE test but this does not translate to a measurable benefic in the overall performance.

Message Edited by JulioJerez on 02-02-2004 05:14 PM

Hi,

If you are using -ipo, you can use the -ipo_S option to generate a multi-file assembly file (ipo_out.s).

If you are seeing a performance degredation with the compiler, I'd recomend submitting a test case (preferable a small test program that reproduced the issue, if possible - small test cases make it much easier to isolate & root case the issue) at http://premier.intel.com

John O'Neill

Hello,

We have been doing performance analysis on various opensource applications (mysql, xalan-xerces) and have found that performance of icc has always been better than gcc or atleast equal.

We were surprised to hear that the performance has degraded while going fron version7 to version 8. Would definitely like to work with you and identify these performance bottlenecks.

Could you try out profiling tools like vtune to identify hotspots.

regards,
Hrishi

Not sure if this is of any use, but Intel has a Quick Reference Guide to Optimization with Intel compilers on their web site: A Step-by-Step Approach to Application Tuning with Intel Compiler. See http://www.intel.com/software/products/compilers/docs/qr_guide.htmfor details.

Christian - Intel Corp.

Unfortunately I am seeing the same issue with Compiler 8.0. Our application is a sci/eng appliction with heavy use of C++. Compiled with /fast and /QaxN the Intel compiled version runs most of our test cases about 20% slower than the VC++6 version on a P4 2.0Ghz. At best, it will equal VC++6 in runtime peformance.

I'm trying to deal with what I assume is a related issue.

The program I have being trying to performance tune ran 10 times slower with 8.0 than with 7.1. Despite some very incorrect information from Vtune, I tracked the majority of the problem to something happening in the exp() function.

In 8.0 exp() dispatches to one of several different methods of computing the result. One of those methods is far slower than the others. Having stepped through the asm code, I can't tell exactly which input ranges go to which of the methods, but I used breakpoints to look at the result value of the slow method many times and all those results were less than 8e-275.

In my application, accuracy in that range of result from exp is not necessary. I put a test around the most significant call to exp(), such that for small input I just used 0 instead of calling exp(). The result ran roughly as fast as the 7.1 version had.

In a separate (accidental) test, I ran the exe compiled with 7.1 using the 8.0 version of libmmd.dll. The results were correct, but took slightly more time than the 8.0 exe (more than 10 times as long as the 7.1 exe with the 7.1 libmmd.dll).

I haven't yet done all the detail testing to see what's really happening here. (I'm posting in case someone else has already and can save me the trouble):
1) The rough results indicate there are more performance problems in the 8.0 libmmd.dll than the ones I sidestepped by the conditional around exp() calls. I haven't identified those, nor checked the 8.0 exe with the 7.1 libmmd.dll (I have checked using the static linked 8.0 exp() instead of the dll and it doesn't help).
2) I assume that slow code was added to exp() for a reason (such as fixing an accuracy problem in the faster version). I haven't stepped through the 7.1 version nor checked whether it has an accuracy problem in the range of input for which 8.0 has the severe performance problem.

To whom it may concern,

I read several postings to this topic. My experiences with versions 7 and 8 of the Intel C/C++ Compiler for Linux regarding performance are quite different: icc is faster than gcc (I compared gcc 3.3.3 to icc 8.0.055).
The main problem is that you have to check your optimization switches. It is not best just to use the maximum optimization switch -O3 or to target the app for a specific platform, e.g. -x[c]. Furthermore, IPO needs some special treatment in the Makefile(s), and do not forget about PGO (profile guided optimization).
In my point of view, the main benefit of the Intel Compiler is the PGO feature. After debugging the app and before packaging a production version of your app, it is worse to think about the PGO instrumentation. If you 'help' the compiler to generate fast code, it will do so.
My hint is: give PGO a try. Perform the three (or more) step compilation scheme, which is proposed by the Intel doc, e.g. compile with PGO instrumentation, run the app with different input params, and perform a feedback compilation using the dynamic profile data collected by the PGO code.
Furthermore, you can use the dynamic profile data with the tool 'codecov' to create beautiful HTML pages, which can give you valuable hints on where to optimize your code on the source level.

With kind regards.

I have only had time to try the 8.0 compiled exe with the 7.1 math library. It runs a tiny bit faster than the 7.1 exe with 7.1 library, which is MUCH faster than the 8.0 exe with the 8.0 library.

So for my application, at least, the entire reason 8.0 is so much slower than 7.1 is in the math library (libmmd.dll).

I haven't had time to compare numeric results nor step through the code to understand the difference between the two math libraries. I'm still hoping someone reading this knows that answer and will tell me.

As for the above suggestion about PGO, that would require major changes to the way my application is built. I hope to try it some time, but that isn't an easy experiment.

Also, PGO couldn't possibly solve the problem that version 8.0 is slower than 7.1. Just the increase in time spent inside exp() in 8.0 vs. 7.1 is much larger than the entire execution time of the application (including exp(), of course) in 7.1. No matter how much better code the 8.0 compiler might generate with PGO (something I doubt anyway, but do want to try) it can't compensate for even a small fraction of the amount that exp() is slower.

Leave a Comment

Please sign in to add a comment. Not a member? Join today