What is wrong with Intel Compiler 11.0

What is wrong with Intel Compiler 11.0

This is not a question but a list of issues.

1. Download is huge (707 MB). Downloading new builds results in downloding redundant copies of IPP, TBB and MKL whose versionsdo not neccessarily change each time with the compiler. Can you spell wasted bandwidth?

2. Folder hierarchy is poorly thought out and it keeps changing -- with every new release I have to edit all include/executable/library paths for both 32-bit and 64-bit versions of all tools (ICC, IPP, MKL, TBB) in Visual Studio. That is reallya major inconvenienceconsidering the fact that the latest compiler builds are unstable and/or produce slower codeso I often have to revert to the previous version.

3. Serious regressions aren't addressed quickly enough (#524001 comes to mind).

4. Important behavior changes aren't properly documented in Release Notes (applies to all compiler releases so far,not just 11.0).

5. Both 10.1 and 11.0 produce a bit slower code than 9.1 at least for me. That isn't exactly progress in my book.

Those are my top five. There is more but I won't bother you any further. Thanks for your attention.

Regards,
Igor Levicki
26 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Igor,

I had to submit a couple issues against 11.0.066 today, and I agree with at least your #1 and #2 (the others I haven't tested yet, to be fair). Previously, we asked the compiler team to split the IA32, x64, and IA64versions into three smaller packages, and they did. I expect this 700MB package to be completely unacceptable to someindustry-leading ISVs. (As a manager, would you allow your team ofhundreds ofdevelopers to each spend the time to download and install a 700MB package every time a needed update is released? It may be difficult value proposition at <100MB.)

Now I wish I had found some free time to participate in the 11.0 beta program. FWIW, I plan to start knocking hard on the AVX-enabled 11.x compiler ASAP, hopefully shaking out the most severe issues.

- Eric

Quoting - Eric Palmer (Intel)

Igor,

I had to submit a couple issues against 11.0.066 today, and I agree with at least your #1 and #2 (the others I haven't tested yet, to be fair). Previously, we asked the compiler team to split the IA32, x64, and IA64versions into three smaller packages, and they did. I expect this 700MB package to be completely unacceptable to someindustry-leading ISVs. (As a manager, would you allow your team ofhundreds ofdevelopers to each spend the time to download and install a 700MB package every time a needed update is released? It may be difficult value proposition at <100MB.)

Now I wish I had found some free time to participate in the 11.0 beta program. FWIW, I plan to start knocking hard on the AVX-enabled 11.x compiler ASAP, hopefully shaking out the most severe issues before they are found outside of Intel.

- Eric

Hello Eric. I am glad that someone else agrees at least partially because it proves I am not insane.

I must admit I do not like the direction in which Intel developer tools are heading. It is all starting to look like a big mess.

Download size is not important just for the customers, it should be important to Intel because you pay for hosting bandwidth and making downloads rediculously big and redundant is a waste of corporate money which could be put to better use.

My idea would be to have a package like this one but also to have all the individual packages available for download. Another way to fix it is to use better compression -- for example, I have just packed 1,818 MB of ICC, IPP and TBB bin, include and library files into a 148.6MB 7zip file. Since 7zip is cross-platform and open-source it could be used to package Intel developer tools because the setup you use is already somewhat proprietary in nature.

As for the issue #2, folder hierarchy really needs fixing ASAP. It is too much hassle as it stands now. I am suggesting something like this (I hope the formatting will not be changed by the forum software):

C:Program FilesIntelCPP o
                           |
                           +----------o bin
                           |          |
                           |          +-o x64
                           |
                           +----------o include
                           |          |
                           |          +-o ipp
                           |          |
                           |          +-o mkl
                           |          |
                           |          +-o tbb
                           |
                           +----------o lib
                                      |
                                      +-o x64

That means no version numbers in the path so we don't have to edit all environment variables and Visual Studio paths each time some of the components gets updated. For more details regarding issues #1 and #2 as well as some rationale please take a look at my feature request here -- https://premier.intel.com/premier/IssueDetail.aspx?IssueID=526761.

As for the issue #4, take a look at the discussion in this thread -- http://software.intel.com/en-us/forums/showthread.php?t=61803. Intel could have done this in a more sensible manner than Microsoft and above all the change should have been documented in big red capital letters in the compiler Release Notes instead of letting people discover it by building and shipping executables with unresolved dependencies.

Regarding #5, I will just say that both 10.1 and 11.0 compiler produce faster code than 9.1 for some functions, and slower code than 9.1 for some other functions, so on average the total execution time is the same. Someone would say that is not a valid reason to complain since overall the speed is not affected or the difference is minimal, but I still consider those code slowdowns as a regression for an optimizing compiler.

Regards,
Igor Levicki

The most common case of slowdown I have observed, which was introduced in 10.0 and not fixed in 11.0, is with C++ transform() which can be optimized only with #pragma ivdep, where in 9.1 it was sufficient (as with g++) to make use of the restrict extension. Without #ivdep, both MSVC9 and gcc out-perform icpc for those cases. I didn't feel I had a strong argument there, as I don't advocate transform(), which can't be optimized anyway without non-standard extensions.

Another performance issue, which is not an icpc version issue, but a g++ version issue, is that builtins which were introduced in g++ 4.3 aren't optimized by icpc unless #pragma ivdep is set.

For example, theSTL copy() function has become unsatisfactory for icpc. Among the argumentsI've heardhere is that literal reading of the C++ standard indicates thatcopy() shouldn't be implementedas memmove(), asMSVCand g++ do. The conclusion drawn from that is that one shouldn't use copy(), but instead make the appropriate choice between memmove() and memcpy(), or a for() loop. Complicating thesituationare the abysmal versions of memcpy() and memmove() provided by default in glibc for x86_64, prior to version 2.8, and the difficulty of supporting any brand of CPU adequately in that situation. Also, gcc 32-bit by default optimizes memcpy() as an inlineinstruction, which is good for a few cases, but very bad for others, while gcc 64-bit doesn't do that. Both CPU manufacturers have introduced recent optimizations for the built-in string instructions, so they aren't so bad on the latest CPUs.

Current SuSE versions of glibc have back-ported the good functions, but icc doesn't specifically supportmany commonSuSE versions. icc still replaces memcpy() with its own version, unless you take specific precautions to stop that.

So, there are several specific points on which you might wish to submit support issues on premier.intel.com, if you wish to influence the decisions.

The most common case of slowdown I have observed, which was introduced in 10.0 and not fixed in 11.0, is with C++ transform() which can be optimized only with #pragma ivdep, where in 9.1 it was sufficient (as with g++) to make use of the restrict extension. Without #ivdep, both MSVC9 and gcc out-perform icpc for those cases. I didn't feel I had a strong argument there, as I don't advocate transform(), which can't be optimized anyway without non-standard extensions. There still are a few cases where transform() along with compiler dependent additions can produce the best code generation.

Another performance issue, which is not an icpc version issue, but a g++ version issue, is that builtins which were introduced in g++ 4.3 aren't optimized by icpc unless #pragma ivdep is set.

For example, theSTL copy() function has become unsatisfactory for icpc. Among the argumentsI've heardhere is that literal reading of the C++ standard indicates thatcopy() shouldn't be implementedas memmove(), asMSVCand g++ do. The conclusion drawn from that is that one shouldn't use copy(), but instead make the appropriate choice between memmove() and memcpy(), or a for() loop. Complicating thesituationare the abysmal versions of memcpy() and memmove() provided by default in glibc for x86_64, prior to version 2.8, and the difficulty of supporting any brand of CPU adequately in that situation. Also, gcc 32-bit by default optimizes memcpy() as an inlineinstruction, which is good for a few cases, but very bad for others, while gcc 64-bit doesn't do that. Both CPU manufacturers have introduced recent optimizations for the built-in string instructions, so they aren't so bad on the latest CPUs.

Current SuSE versions of glibc have back-ported the good functions, but icc doesn't specifically supportmany commonSuSE versions. icc still replaces memcpy() with its own version, unless you take specific precautions to stop that.

So, there are several specific points on which you might wish to submit support issues on premier.intel.com, if you wish to influence the decisions. In the absence of informed customer input, some decisions may seem sub-optimum.

If you don't want libiomp, don't ask for it. Don't set -parallel or -openmp, don't use the threaded versions of performance libraries. It does seem unlikely that you want these for a 4x4 matrix multiplication. You give the impression of using a lot of options without studying them, or even remembering to mention it.

Both libguide and libiomp have been available as alternates since 10.1. libiomp is the default now, in preparation for future removal of libguide. libiomp supports both the libguide and (in the linux verxion) the libgomp function calls.

Where I have seen a growth in code size with the latest compilers, it is due to more automatic "distribution" (splitting) of large loops which are at least partly vectorizable. I have only been able to speculate; this may improve performance of hyperthreading sometimes. Also, 10.1 sometimes failed to split loops automatically where it was needed; unnecessary splitting is less damaging. The directives may be used to prevent individual loops from distributing. The latest compilers are cutting back on automatic unrolling, possible to contain this growth trend. I have seen as much as a 3 times increase in run time with removal of unrolling; I think that is enough to submit specific problem reports. If none of this is relevant to you, at least I am trying to make the point that comments are useless without specifics.

Quoting - tim18

If you don't want libiomp, don't ask for it. Don't set -parallel or -openmp, don't use the threaded versions of performance libraries. It does seem unlikely that you want these for a 4x4 matrix multiplication. You give the impression of using a lot of options without studying them, or even remembering to mention it.

Both libguide and libiomp have been available as alternates since 10.1. libiomp is the default now, in preparation for future removal of libguide. libiomp supports both the libguide and (in the linux verxion) the libgomp function calls.

Where I have seen a growth in code size with the latest compilers, it is due to more automatic "distribution" (splitting) of large loops which are at least partly vectorizable. I have only been able to speculate; this may improve performance of hyperthreading sometimes. Also, 10.1 sometimes failed to split loops automatically where it was needed; unnecessary splitting is less damaging. The directives may be used to prevent individual loops from distributing. The latest compilers are cutting back on automatic unrolling, possible to contain this growth trend. I have seen as much as a 3 times increase in run time with removal of unrolling; I think that is enough to submit specific problem reports. If none of this is relevant to you, at least I am trying to make the point that comments are useless without specifics.

Tim.

Probably, you can look the code as enclosed in earlier post http://software.intel.com/en-us/forums/showthread.php?t=62202

The code is designed in such a way, in first phase it does simple matrix multiplication and in another it uses SSE intrinsic functions. My objective is to learn both "Auto-Parallelization & Vectorization (using Compiler directives & SIMD SSE based)" using this code only and play with it. I am new into these areas, so sometimes my reasoning would not be relevant, all is I am learning in this forum.

I did use command "icc -parallel -par-report3 matrix.c", and I am still in a way to understand what is right & wrong through documents & Intel forum.

E.g: In the same code where no "pragma" has been added, I am getting "LOOPS not parallelized" for below sequential code to start with -

for( i = 0; i < 4; i++ )
{
a[i] = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
b[i] = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
c1[i] = c2[i] = 0.0;
}

Here my belive probably would be -

The compiler may not know the value of rand() & RAND_MAX, the compiler can assume that iterations may depend on each other. Also, aliasing can happen with "c1[i] = c2[i] = 0.0;", so I am getting the "LOOPS not parallelized". I am thinking to try with "#pragma parallel" before the for loop.

Suggest me if you have any insights.

Parallel random number generation is far too complicated to handle here. The simplest answer is to check whether one of the performance libraries (e.g. MKL) has a function such as you want. If you were to have each thread use its own copy of the same generator, using at least private seed and result values, the series used by each thread would differ only to the extent that you seeded them differently. What you have written isn't parallelizable, as it implies that a single random number generator is used. With a fairly simple loop of length 4, the compiler knows anyway that parallelization will slow it down.

This is a bit off the thread topic, but calls to rand() do depend on previous calls to rand(), and like Tim said, you need a seed for each thread, and you could implement this with OpenMP, but I doubt auto-parallelization is possible.
See this article for an example SIMD RNG:http://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor. I've used this in image-processing algorithms, and it's a huge performance gain.
- Eric

Quoting - srimks

Tim.

Probably, you can look the code as enclosed in earlier post http://software.intel.com/en-us/forums/showthread.php?t=62202

The code is designed in such a way, in first phase it does simple matrix multiplication and in another it uses SSE intrinsic functions. My objective is to learn both "Auto-Parallelization & Vectorization (using Compiler directives & SIMD SSE based)" using this code only and play with it. I am new into these areas, so sometimes my reasoning would not be relevant, all is I am learning in this forum.

I did use command "icc -parallel -par-report3 matrix.c", and I am still in a way to understand what is right & wrong through documents & Intel forum.

E.g: In the same code where no "pragma" has been added, I am getting "LOOPS not parallelized" for below sequential code to start with -

for( i = 0; i < 4; i++ )
{
a[i] = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
b[i] = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
c1[i] = c2[i] = 0.0;
}

Here my belive probably would be -

The compiler may not know the value of rand() & RAND_MAX, the compiler can assume that iterations may depend on each other. Also, aliasing can happen with "c1[i] = c2[i] = 0.0;", so I am getting the "LOOPS not parallelized". I am thinking to try with "#pragma parallel" before the for loop.

Suggest me if you have any insights.

I know you are programming in C++, but maybe you can follow the lead from Fortran and use a function that generates a list (array)of random numbers (or in this case two lists). Then perform the scaling of the returned random numbers. This would give you repeatability and provide you with a means to vectorize the loops for your learning experience.

Jim Dempsey

www.quickthreadprogramming.com

I would like to kindly ask the moderators/admins to remove all offtopic conments from this thread starting from the one I reported as abusive and including this one after they clean it up or if the board software allows them to split the content into another thread.

This is "What is wrong with Intel Compiler 11.0" in general (as a product) thread, and not someone's specific code issues. That is what Start New Discussion button is for.

Regards,
Igor Levicki

4. Important behavior changes aren't properly documented in Release Notes (applies to all compiler releases so far, not just 11.0).

I think this is a huge issue. A while ago I switched from VS2003 to VS2008 and a bunch of programs that had worked for a long time no longer compiled. After submitting a bug report it turned out that Intel has deliberately introduced major bugs which get activated when you use VS2008 (I don't know about VS2005) in the name of "Microsoft Compatibility". There is no way to turn them off so I asked Intel for the documentation on them - it's pretty hard to write code when you don't know the rules of the language (which is no longer C++ since Intel is deliberately violating the C++ Standard). That was 7 months ago, still no answer.

- Ron

With 14.0, we have a new online-installer. So it should solve the huge download issue. Well, it is just taking longer than expected to implement ....

Jennifer

>>...it is just taking longer than expected to implement...

Thanks for the update, Jennifer. So, It is better late than never...

And there is an option to fall back to the old installer in case you have only a normal quality on-line connection.

I agree with Igor regarding : 

3. Serious regressions aren't addressed quickly enough (#524001 comes to mind).

4. Important behavior changes aren't properly documented in Release Notes (applies to all compiler releases so far,not just 11.0).

In the case of point 3,  in 4 updates I found errors with 2 of them, so that we had to uninstall and returned to previous version. So, today we are afraid of any update, after a huge download and long instalation process finding that the update break your projects is not fair.

In the case of point 4 , I am very happy to know that other people share my impression.  I came from other compilers and found very confusing and poor the information regarding changes with each intel version of the compiler.

Unfortunately, I do not have a version 9.1 to compare,  But now I am curious about point 5 in Igor comment.

Thanks to Igor for pointing these problems.

As to points 4 and 5, if you tuned up your source code to 9.1 and didn't adopt any new CPUs or compiler options, it's likely that you may not see much performance gain.  It's no secret that a major driver for the introduction of new compiler versions is to deliver value for new architectures.

With the increasing reliance on directives and occasional major changes between compiler versions, it's necessary to re-learn their use; this does in my view detract from the advertised advantage of them.

Current compiler versions have more need of #pragma nofusion to stop fusion of relatively misaligned loops and vectorizable with non-vectorizable ones; prior to xe 2011 #pragma distribute point was used in that sense as well as others. This change was slipped in silently; one might think that release notes explaining such changes might help, or that issue submissions relative to the changes could have drawn out this information.

Also in current compiler versions, OpenMP 4 style directives are gradually replacing earlier ones, but the small individual steps haven't been documented.  If needed documentation is too difficult or controversial a task, this has diminished value until the transition is complete, which apparently won't happen this year.

>>...we are afraid of any update, after a huge download and long instalation process finding that the update break
>>your projects is not fair...

You could try any Update for any software on a dedicated test computer or in VMWare environment.

>>Unfortunately, I do not have a version 9.1 to compare...

It is still available for download and I still use Intel C++ compiler version 8.1 ( Update 038 / Released in 2006 ).

Where can I find version 9.1 of the Intel compiler ?  Thanks.

The procedure for downloading older compilers is the same for Windows, linux, Fortran, or C++:

http://software.intel.com/en-us/articles/older-version-product

>>With 14.0, we have a new online-installer. So it should solve the huge download issue. Well, it is just taking longer than expected to implement ....
>>
>>Jennifer

If that new online-installer will look and work like Android SDK Manager ( No Issues or Problems for a long time I use it ) I'll be very impressed. This is how Android SDK Manager looks like:

I wonder if any Beta testing with real customers was done?

Attachments: 

AttachmentSize
Downloadimage/jpeg androidsdkmanager.jpg118.04 KB

>>Where can I find version 9.1 of the Intel compiler ? Thanks.

Intel Software Registration Center
Web-link: registrationcenter.intel.com/regcenter/register.aspx

Please install the latest Update for Intel C++ compiler version 9.1 since it will have all fixes (!).

By the way, I found one bug in Intel C++ compiler version 8.1 Update 038 and let me know if you're interested to get more technical details. Note: The bug is fixed in versions 12.x and 13.x but I'm Not sure if it is fixed in version 9.1 ( a workaround is very simple ).

Thanx for the info, Intel software registration centre, Excellent resources.

Chief Executive Officer

Hi,  I finally had the time to compare Intel Compiler 11  and Intel Compiler 9.1.  It is depressive !  For most of my test codes 9.1 outperform the new compiler by large.   In all cases I compiled to SSE2.  One snapshot shows results for 11.0 (left)  and 9.1 (right).The other snapshot shows the vectorization and auto parallelization report for both compilers.

In the case of compiler 11 (update 4) I also tried SSE3  and SSE4.2,  without any additional gain.

Attachments: 

NOTE   : the compiler I used were :   13.1.0.149   and   9.1.040

>>... I finally had the time to compare Intel Compiler 11 and Intel Compiler 9.1. It is depressive !..

Armando, This is Not new and let's be positive. Why wouldn't you post source codes of your test cases for analysis? Thanks in advance.

Leave a Comment

Please sign in to add a comment. Not a member? Join today