icl vs. msvc9 frustration

pvonkaenel
Total Points:
3,600
Status Points:
3,100
Brown Belt
July 10, 2009 7:37 AM PDT
Rate
 
#2 Reply to #1
Quoting - tim18
You conceal one of the most important pieces of information from us, your typical loop lengths.  If you are concealing that also from the compiler, it can easily make bad decisions.  The standard compiler assumption, where no clear information is present in source code, is of a loop length 100 (with code suitable also for several times that length).  Many of the ICL loop optimizations aren't suitable for shorter loop lengths, while VC9 doesn't bother optimizing for longer loop lengths.  How's that for a generalization almost as flagrant as yours?
_intel_new_memset() and _intel_new_memcpy() contain branches to optimize several different cases of CPUS, alignments, and loop lengths.  If you were to write all those cases into your source code, you would likely lose instruction cache locality, and lose time with all the selections if your loops are never long enough to require long loop optimizations.
It's dead simple to write artificial cases where these special functions will beat VC9 by a big margin, but those cases may be nothing like your application. 
ICL should avoid the automatic memset and memcpy substitutions when you place multiple array moves in a loop.
It's easily possible for VC9 to match ICL performance when there is no benefit from vectorization.  You have available ICL flags to match your VC9 flags; ICL /O1 /fp:source might come close to matching CL /O2 /fast.
The occasional better optimization of VC9 for loops with opportunities for loop carried scalar replacement may sometimes be matched by writing scalar replacements into your source code.
VC9 observes parentheses faithfully, while ICL treats them K&R fashion unless you set options such as /fp:source.  If you don't rely on the compiler performing algebraic simplification across parentheses, in violation of language standards, the VC9 treatment is superior.

Hi Tim and thanks for your input,

I was not trying to concel loop lengths (actually did not know that it was that important).  There are lots of short loops in the code, and a few large ones.  Also, I do not think your statement about VC9 not bothering to optimize long loops is flagrantly general: I find it quite helpful.  I think these statements of yours may be the missing pieces I was asking about, and will start me on a new round of testing.  If I can get back to original performance with ICL, I would be inclined to use it doe to all the available compiler options I can play with.  I will try /O1, but since there is very little floating point, I think I will skip /fp:source in the first go around.  I still have a few questions if you do not mind:

1) If I use /O1 will the compilerstill vectorize?
2) Is there a flag to disable use of _intel_new_memset and memcpy?  I think their use is skewing my results and is making it more difficult for me to compare timings.
3) If /O1 does disable vectorization, can I re-enable it on a per loop basis using the pragma?
4) My main hotspots have a lot of short loops in them, but they are called many times.  Why will vectorizing not help on these short loops (if in fact that is what you ment by "Many of the ICL loop optimizations aren't suitable for shorter loop lengths".)
5) What type of speedups do you tend to see with ICL over VC9?

Thanks again,
Peter

Intel Software Network Forums Statistics

8491 users have contributed to 31629 threads and 100764 posts to date.
In the past 24 hours, we have 32 new thread(s) 141 new posts(s), and 200 new user(s).

In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to Crash when loading skeleton The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member shadowwolf99