Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
pvonkaenel
Total Points:
3,495
Status Points:
2,995
Brown Belt
July 10, 2009 5:14 AM PDT
icl vs. msvc9 frustration

Hi all,

Let's start with a little history.  Many years ago I needed to speed up a computer vision application which was being compiled with VC6.  I downloaded the Intel C++ eval, and after playing with compiler options for a couple of hours was able to speed up the application by about 30%.  I bought the tool, moved a ot of the code into a library, and used it for years.

About 2 years ago I tried recompiling that same library but with VC9 and found the VC9 version was slightly faster than the ICL version.  I played with setting for about a day, but in the end switched the library to VC9.

Since then, on completely different code bases, I have tried converting my VC9 projects to ICL but after seeing either no gain, or 1%-2% performance drop, I switch back to VC9.

Well, over the last few days I decided to try again on yet another new code base and really spend some time on it.  I've followed the step in the ICL manual for how to go about optimizing an application, and have also followed suggestions I've read in "The Software Optimization Cookbook".  I've done my own timings, and used VTune to check the performance.

According to VTune, some of the most expensive routines are slightly faster, but the routine at the top of the VTune list is slightly slower.  I had tried to optimize this routine with SSE3 intrinsics without luck myself, so I checked the vectorization report and found two of the main loops were being vectorized, so I put "#pragma novector" in front of them, and that routine went back to roughly the same time as when compiled with VC9.  OK, the ICL version is still slightly slower than the VC9 version unless I turn on IPO.  With IPO enabled, the ICL version is about 1% faster (about a 1.5% gain), but at the expense of a several minute link time instead of a couple of seconds.

One thing I noticed in the VTune output is that _intel_new_memset is now the hotest routine, and _intel_new_memcpy is not far behind.  There are several places in the code where memset and memcpy are used, but I'm finding it difficult to compare performance per routine with these two large hotspots.  I have the following questions:

1) What am I doing wrong?  I must be missing something to be having this much trouble getting ICL with all its optimizations to speed up this application.  In fact, unless I turn on IPO, all ICL options I have tried end up being about 1% slower than VC9.  I think this is mostly due to all the time being spent in _intel_new_memset.

2) Is there some way I can disable the use of _intel_new_memset and _intel_new_memcpy so that I can get a better idea how the VC9 and ICL versions of the routines compare?

The project I'm currently working on, and I have spent several days trying to optimize is very large and was written by others.  I would love to rewrite the internal image flow and optimize at the algorithm level, but I don't have the time or resources for that at this time.  Also, due to the project size I cannot upload an example.  Does anyone have a counter example that shows ICL outperforming VC9?

Any pointers would be greatly appreciated.

Thanks,
Peter
tim18
Total Points:
66,397
Status Points:
66,397
Black Belt
July 10, 2009 7:08 AM PDT
Rate
 
#1
You conceal one of the most important pieces of information from us, your typical loop lengths.  If you are concealing that also from the compiler, it can easily make bad decisions.  The standard compiler assumption, where no clear information is present in source code, is of a loop length 100 (with code suitable also for several times that length).  Many of the ICL loop optimizations aren't suitable for shorter loop lengths, while VC9 doesn't bother optimizing for longer loop lengths.  How's that for a generalization almost as flagrant as yours?
_intel_new_memset() and _intel_new_memcpy() contain branches to optimize several different cases of CPUS, alignments, and loop lengths.  If you were to write all those cases into your source code, you would likely lose instruction cache locality, and lose time with all the selections if your loops are never long enough to require long loop optimizations.
It's dead simple to write artificial cases where these special functions will beat VC9 by a big margin, but those cases may be nothing like your application. 
ICL should avoid the automatic memset and memcpy substitutions when you place multiple array moves in a loop.
It's easily possible for VC9 to match ICL performance when there is no benefit from vectorization.  You have available ICL flags to match your VC9 flags; ICL /O1 /fp:source might come close to matching CL /O2 /fast.
The occasional better optimization of VC9 for loops with opportunities for loop carried scalar replacement may sometimes be matched by writing scalar replacements into your source code.
VC9 observes parentheses faithfully, while ICL treats them K&R fashion unless you set options such as /fp:source.  If you don't rely on the compiler performing algebraic simplification across parentheses, in violation of language standards, the VC9 treatment is superior.


pvonkaenel
Total Points:
3,495
Status Points:
2,995
Brown Belt
July 10, 2009 7:37 AM PDT
Rate
 
#2 Reply to #1
Quoting - tim18
You conceal one of the most important pieces of information from us, your typical loop lengths.  If you are concealing that also from the compiler, it can easily make bad decisions.  The standard compiler assumption, where no clear information is present in source code, is of a loop length 100 (with code suitable also for several times that length).  Many of the ICL loop optimizations aren't suitable for shorter loop lengths, while VC9 doesn't bother optimizing for longer loop lengths.  How's that for a generalization almost as flagrant as yours?
_intel_new_memset() and _intel_new_memcpy() contain branches to optimize several different cases of CPUS, alignments, and loop lengths.  If you were to write all those cases into your source code, you would likely lose instruction cache locality, and lose time with all the selections if your loops are never long enough to require long loop optimizations.
It's dead simple to write artificial cases where these special functions will beat VC9 by a big margin, but those cases may be nothing like your application. 
ICL should avoid the automatic memset and memcpy substitutions when you place multiple array moves in a loop.
It's easily possible for VC9 to match ICL performance when there is no benefit from vectorization.  You have available ICL flags to match your VC9 flags; ICL /O1 /fp:source might come close to matching CL /O2 /fast.
The occasional better optimization of VC9 for loops with opportunities for loop carried scalar replacement may sometimes be matched by writing scalar replacements into your source code.
VC9 observes parentheses faithfully, while ICL treats them K&R fashion unless you set options such as /fp:source.  If you don't rely on the compiler performing algebraic simplification across parentheses, in violation of language standards, the VC9 treatment is superior.

Hi Tim and thanks for your input,

I was not trying to concel loop lengths (actually did not know that it was that important).  There are lots of short loops in the code, and a few large ones.  Also, I do not think your statement about VC9 not bothering to optimize long loops is flagrantly general: I find it quite helpful.  I think these statements of yours may be the missing pieces I was asking about, and will start me on a new round of testing.  If I can get back to original performance with ICL, I would be inclined to use it doe to all the available compiler options I can play with.  I will try /O1, but since there is very little floating point, I think I will skip /fp:source in the first go around.  I still have a few questions if you do not mind:

1) If I use /O1 will the compilerstill vectorize?
2) Is there a flag to disable use of _intel_new_memset and memcpy?  I think their use is skewing my results and is making it more difficult for me to compare timings.
3) If /O1 does disable vectorization, can I re-enable it on a per loop basis using the pragma?
4) My main hotspots have a lot of short loops in them, but they are called many times.  Why will vectorizing not help on these short loops (if in fact that is what you ment by "Many of the ICL loop optimizations aren't suitable for shorter loop lengths".)
5) What type of speedups do you tend to see with ICL over VC9?

Thanks again,
Peter

tim18
Total Points:
66,397
Status Points:
66,397
Black Belt
July 10, 2009 8:40 AM PDT
Rate
 
#3 Reply to #2

/O1 disables vectorization, since ICL 10.0; I mentioned that in case your loops are too short for vectorization to be useful.  In version 9.1, /O1 vectorized, but without extra unrolling, thus giving vector performance on shorter loops than /O2 did.
ICL vectorization typically takes loop iterations in groups of 8, with adustments for 16-byte alignment before and after.  It doesn't often pay off for loops of length less than 16 plus the adjustments, and you will see performance peaking for loop lengths at intervals of 8.
In typical C or C++ code, unless arrays are declared with fixed size local to the function, it's nearly impossible for the compiler to pick up information to change the default assumption that optimization should be for loop length 100.
If you know that no alignment adjustment is required at the beginning of the loop to make all data 16-byte aligned, but it's not visible to the compiler,
#pragma vector aligned
should speed up the loop, but it will break if your assertion is wrong. This pragma also over-rides the compiler's cost/benefit analysis where it decides whether vectorization should gain.
#pragma no vector
would prevent vectorization of a loop.
Vectorization of loops of length 60 to 3000 should more than double the performance.  When combined with OpenMP or similar parallelization, the combined gain is better on the current Core i7 or Xeon 5500 CPUs than on the earlier ones.  Still, it is common to find a loop of length 1000 where either vectorization or parallelization gives good speedup, but there is no use in combining the optimizations, unless the parallelization can take place at a higher level.

pvonkaenel
Total Points:
3,495
Status Points:
2,995
Brown Belt
July 10, 2009 8:58 AM PDT
Rate
 
#4 Reply to #3
Quoting - tim18

/O1 disables vectorization, since ICL 10.0; I mentioned that in case your loops are too short for vectorization to be useful.  In version 9.1, /O1 vectorized, but without extra unrolling, thus giving vector performance on shorter loops than /O2 did.
ICL vectorization typically takes loop iterations in groups of 8, with adustments for 16-byte alignment before and after.  It doesn't often pay off for loops of length less than 16 plus the adjustments, and you will see performance peaking for loop lengths at intervals of 8.
In typical C or C++ code, unless arrays are declared with fixed size local to the function, it's nearly impossible for the compiler to pick up information to change the default assumption that optimization should be for loop length 100.
If you know that no alignment adjustment is required at the beginning of the loop to make all data 16-byte aligned, but it's not visible to the compiler,
#pragma vector aligned
should speed up the loop, but it will break if your assertion is wrong. This pragma also over-rides the compiler's cost/benefit analysis where it decides whether vectorization should gain.
#pragma no vector
would prevent vectorization of a loop.
Vectorization of loops of length 60 to 3000 should more than double the performance.  When combined with OpenMP or similar parallelization, the combined gain is better on the current Core i7 or Xeon 5500 CPUs than on the earlier ones.  Still, it is common to find a loop of length 1000 where either vectorization or parallelization gives good speedup, but there is no use in combining the optimizations, unless the parallelization can take place at a higher level.

I'm using ICL 11.1.  I ran a test with /O1 and with vec-report1 I am still getting loop vectorization notification.  I'm also getting notification about vectorized blocks.  My worst hotspot is slowed by vectorization so I added the "#pragma novector" before the loops, but I'm still getting block vectorization within the loop and poor performance.  Is there a way to disable block vectorization within a function?

tim18
Total Points:
66,397
Status Points:
66,397
Black Belt
July 10, 2009 11:00 AM PDT
Rate
 
#5 Reply to #4
Block vectorized normally ought to be a good optimization, and it might be worth a problem report, or an example posted here, if it's not.
/Oi- may stop the use of -intel_fast_mem functions; I'm more familiar with the similar linux compiler option.  If it prevents the library call, but produces vectorization, you may still have to consider whether vectorization is good there.


pvonkaenel
Total Points:
3,495
Status Points:
2,995
Brown Belt
July 10, 2009 11:42 AM PDT
Rate
 
#6 Reply to #5
Quoting - tim18
Block vectorized normally ought to be a good optimization, and it might be worth a problem report, or an example posted here, if it's not.
/Oi- may stop the use of -intel_fast_mem functions; I'm more familiar with the similar linux compiler option.  If it prevents the library call, but produces vectorization, you may still have to consider whether vectorization is good there.

You are right: using /O- does prevent the use of the _intel_new_mem functions, however the run-time explodes due to the other lost inline intrinsic expansion options.  Unfortunately the Windows version of ICL does not allow for disabling specific inlines the way the Linux version does.

As for the block vectorization, I'm not sure if that's what slowing down my hotspot or not, I was just assuming it was since my own efforts to optimize the routine with intrinsics failed.

At this point I have tried all the optimizations in various combinations available through the VS2008 GUI and the VC9 build always wins.  I guess I can take a look at the other options that are not exposed through the GUI, but I'm starting to run out of steam on this optimization technique.



Intel Software Network Forums Statistics

8285 users have contributed to 31229 threads and 99106 posts to date.
In the past 24 hours, we have 16 new thread(s) 55 new posts(s), and 81 new user(s).

In the past 3 days, the most popular thread for everyone has been comparison cilk++, openmp, pthreads first results The most posts were made to comparison cilk++, openmp, pthreads first results The post with the most views is Very amusing...  Escalated as

Please welcome our newest member tvinni