| Thread Tools | Search this thread |
|---|
pvonkaenel
| July 10, 2009 5:14 AM PDT icl vs. msvc9 frustration | ||||
Hi all, Let's start with a little history. Many years ago I needed to speed up a computer vision application which was being compiled with VC6. I downloaded the Intel C++ eval, and after playing with compiler options for a couple of hours was able to speed up the application by about 30%. I bought the tool, moved a ot of the code into a library, and used it for years. About 2 years ago I tried recompiling that same library but with VC9 and found the VC9 version was slightly faster than the ICL version. I played with setting for about a day, but in the end switched the library to VC9. Since then, on completely different code bases, I have tried converting my VC9 projects to ICL but after seeing either no gain, or 1%-2% performance drop, I switch back to VC9. Well, over the last few days I decided to try again on yet another new code base and really spend some time on it. I've followed the step in the ICL manual for how to go about optimizing an application, and have also followed suggestions I've read in "The Software Optimization Cookbook". I've done my own timings, and used VTune to check the performance. According to VTune, some of the most expensive routines are slightly faster, but the routine at the top of the VTune list is slightly slower. I had tried to optimize this routine with SSE3 intrinsics without luck myself, so I checked the vectorization report and found two of the main loops were being vectorized, so I put "#pragma novector" in front of them, and that routine went back to roughly the same time as when compiled with VC9. OK, the ICL version is still slightly slower than the VC9 version unless I turn on IPO. With IPO enabled, the ICL version is about 1% faster (about a 1.5% gain), but at the expense of a several minute link time instead of a couple of seconds. One thing I noticed in the VTune output is that _intel_new_memset is now the hotest routine, and _intel_new_memcpy is not far behind. There are several places in the code where memset and memcpy are used, but I'm finding it difficult to compare performance per routine with these two large hotspots. I have the following questions: 1) What am I doing wrong? I must be missing something to be having this much trouble getting ICL with all its optimizations to speed up this application. In fact, unless I turn on IPO, all ICL options I have tried end up being about 1% slower than VC9. I think this is mostly due to all the time being spent in _intel_new_memset. 2) Is there some way I can disable the use of _intel_new_memset and _intel_new_memcpy so that I can get a better idea how the VC9 and ICL versions of the routines compare? The project I'm currently working on, and I have spent several days trying to optimize is very large and was written by others. I would love to rewrite the internal image flow and optimize at the algorithm level, but I don't have the time or resources for that at this time. Also, due to the project size I cannot upload an example. Does anyone have a counter example that shows ICL outperforming VC9? Any pointers would be greatly appreciated. Thanks, Peter | |||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
| 8285 users have contributed to 31229 threads and 99106 posts to date. |
|---|
| In the past 24 hours, we have 16 new thread(s) 55 new posts(s), and 81 new user(s). In the past 3 days, the most popular thread for everyone has been comparison cilk++, openmp, pthreads first results The most posts were made to comparison cilk++, openmp, pthreads first results The post with the most views is Very amusing... Escalated as Please welcome our newest member tvinni |