Compiled XE 2013.0 Update 1 generates slower code than 12.1

Compiled XE 2013.0 Update 1 generates slower code than 12.1

After switching from 12.1 to Composer XE 2013 ( Update 1, Windows 64-bit) I am seeing a consistent 10-15% slowdown across the board( code is built and  benchmarked on a Quad Core Xeon). C++ Code compiled /O3, no auto-parellization.

Is this a known issue to be fixed in an update?

33 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I have run Amplifier XE 2013 and profiled the code. It appears XE 2013 is NOT inlining a simple function that 12.1 inlined. Compiler option is /Ob2

      template <class T>
      inline T Matrix::operator()(int i, int j) const
       {
        return data()[i*rowstep+j*colstep];
      }

Let me rephase this, Compiler 13.0 is not inlining the function when the operator(int ,int) is used many times in a complex expression, while 12.1 would seemingly be inlining the function. In the following expression ( this is auto-generated code),13.0 is  seems to be generating function calls for operator (int,int) rather than in-line code, even at /O3 /Ob2

        result(0,0)=(-(Z(2-1,4-1)*Z(3-1,3-1)*Z(4-1,2-1)) + Z(2-1,3-1)*Z(3-1,4-1)*Z(4-1,2-1) + Z(2-1,4-1)*Z(3-1,2-1)*Z(4-1,3-1) - Z(2-1,2-1)*Z(3-1,4-1)*Z(4-1,3-1) - Z(2-1,3-1)*Z(3-1,2-1)*Z(4-1,4-1) + 
        Z(2-1,2-1)*Z(3-1,3-1)*Z(4-1,4-1))*tmp;

>>...After switching from 12.1 to Composer XE 2013 ( Update 1, Windows 64-bit) I am seeing a consistent 10-15% slowdown...

My first question to myself was how is it possible to have so significant slowdown? However after seeing you expression I'm really concerned since I also use lots of templates and different C++ operators ( declared as inline ) to simplify processing.

So, my question: Is it a feature or a bug with the latest version of Intel C++ compiler?

Hello,

with comparing major compiler versions it's not a surprise you see differences in optimization. We have a conservative release model that some people like, others don't: Only major version updates (12.x -> 13.x) are subject of bigger changes. The update releases in between should not show much performance variations but also offer less features.

I'm not surprised you see a change, moving from 12.1 to 13.0. You tweaked your compiler option set for 12.1 either by intention or indirectly. So, it works best with 12.1 and must not with any other major version.

In-lining is a complicated topic and heuristics used are, by definition, never optimal for all scenarios. The following can help getting the function above in-lined again:

- Use IPO (try both, single-file & multi-file)
- Try /Qinline-forceinline so "inline" keywords force in-lining (as long as in-lining limits are not exceeded)
- Use #pragma forceinline
- Slighlty(!) increase the inlining factor (/Qinline-factor) or the other in-lining limits
- Also give /O2 a chance - /O3 is not always better

Best regards,

Georg Zitzlsberger

Quote:

vasci_ wrote:

Let me rephase this, Compiler 13.0 is not inlining the function when the operator(int ,int) is used many times in a complex expression, while 12.1 would seemingly be inlining the function. In the following expression ( this is auto-generated code),13.0 is  seems to be generating function calls for operator (int,int) rather than in-line code, even at /O3 /Ob2

        result(0,0)=(-(Z(2-1,4-1)*Z(3-1,3-1)*Z(4-1,2-1)) + Z(2-1,3-1)*Z(3-1,4-1)*Z(4-1,2-1) + Z(2-1,4-1)*Z(3-1,2-1)*Z(4-1,3-1) - Z(2-1,2-1)*Z(3-1,4-1)*Z(4-1,3-1) - Z(2-1,3-1)*Z(3-1,2-1)*Z(4-1,4-1) +         Z(2-1,2-1)*Z(3-1,3-1)*Z(4-1,4-1))*tmp;

Is is possible to send a testcase?  It's better to find out why.

Also can you check this report: "/Qopt-report-phase:ipi /Qopt-report-routine:the_func_name". does it say why it is not inlined?

Jennifer

>>...Also can you check this report: "/Qopt-report-phase:ipi /Qopt-report-routine:the_func_name". does it say why it is not inlined?

Thanks, Jennifer. Could you try a different test-case?

...
const float Pi = 3.141592653589793;

inline float area( const float r )
{
return ( Pi * r * r );
}

void main( void )
{
printf( "The area is: %f\n", area( area( area( area( area( 2.0 ) ) ) ) ) );
}
...

I'm lookoing into producing a report of why it is not inlined. When using /Qopt-report-routine:the_func_name, how do you specifiy a C++ template operator () as "the_func_name".

Hello,

please use the mangled names you already find in the /Qopt-report output.

Edit:
To give an example, lets assume a template function "foo", defined like this...


template <class T>

T foo(T x) { ... }

The mangled name for foo<int> would look like ??$foo@H@@YAHH@Z, for foo<float> like ??$foo@H@@YAMM@Z, etc.

There's a nice on-line C++ demangler; just look for c++filtjs.

Best regards,

Georg Zitzlsberger

Using /Qopt-report there is a significant difference between 12.1 and 13.0 when inlining this function.

Just to make sure there is no confusion. This seems to be a very specific issue with under a very specific circumstances. Once this routine was "fixed" the performance of our benchmarks using 13.0 vs 12.1 was similar , if not better.

What happense when you use

...
inline T Matrix::operator()(const int i, const int j) const
...

Also, several months ago I han an issue where inline would not inline, however replacing with forceinline did work.
Then later, inline would work again. Never figured out what triggered the behavior.

Jim Dempsey

www.quickthreadprogramming.com

I am looking at the inlining of the expressions that use z(i,j). Class T is a "DComplex" (DP Complex)

      template <class T>       inline T Matrix::operator()(int i, int j) const        {         return data()[i*rowstep+j*colstep];       }

FYI, undecoration of functions....

Undecoration of :- "??R?$RWGenMat@VDComplex@@@@QEBA?AVDComplex@@HH@Z"
is :- "public: class DComplex __cdecl RWGenMat<class DComplex>::operator()(int,int)const __ptr64"
Undecoration of :- "?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ"
is :- "public: class DComplex const * __ptr64 __cdecl RWGenMat<class DComplex>::data(void)const __ptr64"
Undecoration of :- "??0DComplex@@QEAA@AEBV0@@Z"
is :- "public: __cdecl DComplex::DComplex(class DComplex const & __ptr64) __ptr64"

12.1 appears to  inline all three line functions in a call to z(i,j)

-> INLINE (MANUAL): ??R?$RWGenMat@VDComplex@@@@QEBA?AVDComplex@@HH@Z(751) (isz = 12) (sz = 25 (5+20))
1>      -> INLINE (MANUAL): ?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ(753) (isz = 0) (sz = 6 (2+4))
1>      -> INLINE (MANUAL): ??0DComplex@@QEAA@AEBV0@@Z(752) (isz = 3) (sz = 12 (4+8))

In particular, 12.1 reports 378 inlines of DComplex const * __ptr64 __cdecl RWGenMat<class DComplex>::data(void)const __ptr64

13.0 does not report ANY inlines of this function. This seems to be what I am seeing ( performance drop due to no-inline )

13.0 reports something a bit "odd", that is not seen in the 12.1 report....is this a clue?
1>  IPO DEAD STATIC FUNCTION ELIMINATION;?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ;0>
1>  DEAD STATIC FUNCTION ELIMINATION:
1>    (?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ)
1>    Routine is dead extern
1>

Attachments: 

AttachmentSize
Download inline-12-1.log231.8 KB
Download inline-13-0.log149.25 KB

Hi everybody, Here are a couple of notes:

Note 1:

vasci_, Did you check the report for Debug or Release configurations?

Note 2:

I see that a call to another indexing C++ operator data()[ offset ] is used in:
...
template inline T Matrix::operator()(int i, int j) const { return data()[i*rowstep+j*colstep]; }
...
I would use a pointer to a data set directly without calling the additional indexing C++ operator.

- This is release configuration

- not sure what you are referring to when you say "I would use a pointer to a data set directly without calling the additional indexing C++ operator.".

data() returns a "raw" pointer. The line of code data()[i*rowstep+j*colstep] is a "C" array operation. That is , simple pointer arithmetic. I am sure the compiler can deal with that.

Anway, the bottom line is 12.1 vs 13.0

  • Identical code
  • Identical compiler options
  • different inlining results for a complex expression that negatively affect the performance of my code.

I will try an bundle this up in an acceptable way for Premier Support.

A duplicate post removed. Sorry about that and I'm not sure why it happens from time to time.

Thanks for the feedback.

>>...- not sure what you are referring to when you say "I would use a pointer to a data set directly without calling the additional
>>indexing C++ operator.".

Here is an example:
...
template < class T, ... > class TDataSet
{
public:
...
inline T * operator[]( RTint iIndex )
{
...
return ( T * )m_ptData2D[ iIndex ];
};
...
private:
...
T **m_ptData2D;
...
};

Hello,

I've finally reproduced the problem and forwarded to engineering (DPD200242982).
As soon as I learn more I'll let you know.

Thank you & best regards,

Georg Zitzlsberger

Great! Thanks for the effort to track this down!

Hello,

engineering just added an improvement for Intel(R) Composer XE 2013 SP1, which is currently BETA. Initial release will be available end of this year. 13.x won't be get this improvement because we only do stability fixes there right now.

There's also a "workaround": If you use /Qip you should get it in-lined with the current 13.x compilers as well.

Best regards,

Georg Zitzlsberger

Hi Georg,

I wonder if Intel software engineers do regular performance evaluations of the Intel C++ compilers against another C++ compilers ( for different platforms ) for at least the most important /O-like options, like /O2. In a set of my recent tests Intel C++ compiler did not perform well when compared to some older C++ compilers. Unfortunately, I can't provide any sources of these tests ( more than 2,500 code lines in total ) but a final result is as follows:

1. MinGW version 3.4.2 ( winner / released in 2004 / outperformed by ~5% )
2. Microsoft C++ compiler versions of VS 2005 & VS 2008
3. Intel C++ compiler versions v12.x & v13.x
4. Borland C++ compiler version 5.x

Note: Optimize for speed option /O2 was used in all test cases for a recursive matrix multiplication algorithm / single-precision floating point data types.

I understand that my information is too generic and doesn't provide enough technical details but believe me that code efficiency of MinGW C++ compiler is really good (!).

Hi,  Reading this topic I decided to re-compile some of my own test codes with Intel 12.1.4.325,  Intel 13.1.0.149  and MSVC 2008. Follows some results in seconds :

                                    13.1.0.149          12.1.4.325      MSVC 2008

NNS  brute                       43.87                 52.55           70.40  

NNS  NearPT                   0.363                  0.411           0.2893

NNS kdtree                      1.138                  1.152           1.074

Mixed code                       3.732                 3.649            4.937

All compilations with equivalent optimization options, generating code for 32 bits, running under Windows 7 in i7 920. Target proc. as SSE2 only.  NNS is "Nearest Neighbor Search", in three flavors .  "mixed code"  use 64 bits integer, floating point operations including some trascendental functions.  

Hi Sergey.  I was impressed by the times you reported with MinGW, so I teste the last version available.  But my results are very bad for the same test cases in my previous post.    I am not familiar with MinGW, my switches were  -O2 -msse3 .  Results :

NNS  brute       107.06

NNS NearPt     1.989

NNS kdtree      5.951

Mixed code     7.336  (very good for 64 bits ints, very bad for transcendental and Taylor series)

Could you please , recommend me better switches for maximum performance in SSE3 with MinGW ?

Thanks.  

The original issue I reported was a very specific issue related to inlining inside a 'bottleneck' function .Once this specific issue was worked-around, I found no other performance issues with compiler 13.1

[ Armando wrote ]

>>... I am not familiar with MinGW, my switches were -O2 -msse3

These options look right.

[ Vasci_ wrote ]

>>...The original issue I reported was a very specific issue related to inlining inside a 'bottleneck' function .Once this specific issue was
>>worked-around, I found no other performance issues with compiler 13.1...

Everybody wants to get the best return from Intel C++ compiler and that is why these performance evaluations are done. Personally, I want to be confident in Intel C++ compiler and, of course, I understand that it is simply impossible for Intel software developers / engineers to test the compiler with tens of thousands of different algorithms. C/C++ codes I've tested are very portable and I'm concerned that there are no performance gains in some cases. I'm less concerned when some "cool" C++11 feature is partially supported or not supported at all. I simply need to do processing in as fastest as possible way.

Here is a short follow up...

I've done additional verification and I see that the situation is more complicated. MinGW C++ compiler ( v3.4.2 ) outperformed Intel C++ compiler ( v12.x ) by ~19 percent and this is a significant difference.

[ MinGW C++ compiler - -O2 ]
...
Matrix Size : 1024 x 1024
...
Strassen HBC - Pass 1 - Completed: 4.32800 secs
Strassen HBC - Pass 2 - Completed: 2.57800 secs
Strassen HBC - Pass 3 - Completed: 2.57800 secs
Strassen HBC - Pass 4 - Completed: 2.56300 secs Note: Best time ( BT1 ) for MinGW - ~19 percent faster than BT2
Strassen HBC - Pass 5 - Completed: 2.57800 secs
...

[ Intel C++ compiler - /O2 ]
...
Matrix Size : 1024 x 1024
...
Strassen HBC - Pass 1 - Completed: 4.92200 secs
Strassen HBC - Pass 2 - Completed: 3.20300 secs
Strassen HBC - Pass 3 - Completed: 3.18700 secs Note: Best time ( BT2 ) for ICC
Strassen HBC - Pass 4 - Completed: 3.20400 secs
Strassen HBC - Pass 5 - Completed: 3.18700 secs
...

Do you want me to compare MinGW C++ compiler ( v3.4.2 ) vs. Intel C++ compiler ( v13.x ) on a:

Dell Precision Mobile M4700
Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )
32GB RAM
320GB HDD
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit

system?

Hi  Sergey,  Your results are astonishing for me !  My source code is not C++, it is C99, this could make a great difference for the compiler optimizer (only my guessing).   In our case performance is very important because we develop medical physics systems for radiation therapy planning and image guided neurosurgery; in some regions a solution takes several seconds or minutes.   I frequently use a battery of test cases in C99 that resembles real problems.  I will try some other codes with MinGW but my first impressions were frustraiting with it. For Intel compiler I mostly use :

/O3 /Ob2 /Oi /Ot /Oy /Qip /GA  /GF /MT  /GS- /arch:SSE2 /fp:fast=2  /Qfp-speculation:fast /fp:double /Qparallel /Qstd=c99 

The use of  a conservative SSE2 is to avoid problems with users keeping old hardware and some with AMD CPUs.   Some of this switches are unknown for me in MinGW, for example  /Qparallel .

When I have new reults I will put them here.

Armando

>>...The use of a conservative SSE2 is to avoid problems with users keeping old hardware and some with AMD CPUs...

This is absolutely right approach and I can not assume that everybody has a computer with a latest Intel CPU with support for AVX or AVX2 instruction sets. I also don't think that application of pure C, C99, C++ or C++11 should affect a compiler's optimization. For example, Borland C++ is a 15+ year old technology and it outperforms almost all modern C++ compilers when optimization switches are not used at all when compiling sources. It is a clear indication of how efficient its default code generation is.

I'm trying to bring attention to that matter of Intel software engineers for some time and I'm not sure that resolute steps to reduce overheads of Intel C++ compiler default code generation are done.

Thanks for the information about your software and it is very interesting to know.

Quote:

Armando Lazaro Alaminos Bouza wrote:

  My source code is not C++, it is C99, this could make a great difference for the compiler optimizer (only my guessing).   In our case performance is very important because we develop medical physics systems for radiation therapy planning and image guided neurosurgery; in some regions a solution takes several seconds or minutes.   I frequently use a battery of test cases in C99 that resembles real problems.  I will try some other codes with MinGW but my first impressions were frustraiting with it. For Intel compiler I mostly use :

/O3 /Ob2 /Oi /Ot /Oy /Qip /GA  /GF /MT  /GS- /arch:SSE2 /fp:fast=2  /Qfp-speculation:fast /fp:double /Qparallel /Qstd=c99 

The use of  a conservative SSE2 is to avoid problems with users keeping old hardware and some with AMD CPUs.   Some of this switches are unknown for me in MinGW, for example  /Qparallel .

C99 vs C++ makes no difference to the optimizer, with the compilers mentioned here, unless your style changes (as it well might).

g++ accepts the C99 restrict qualifier if it is spelled __restrict   (and Intel C++ for linux accepts that spelling as well)  ICL accepts restrict in C++ code with option /Qrestrict.  Both Intel and gnu compilers make good use of restrict pointers ( * __restrict ptr).

You have a point in that some of the C99 features for optimization will not work with all C++ compilers

Intel 14.0 beta shows better optimization of certain STL as well as plain C99 features enabled by __restrict where current Intel released compilers require #pragma ivdep.

I find such long option strings confusing  I will mention that /fp:double could prevent some optimizations on float data type due to the requirement to promote so many operations to double.  It also prevents optimizations on sum reduction, where it would be better to write in the double casts and reduction variables in your source code so as to control how it's applied.

If you use only double data types, /fp:double I think would have the same effect as /fp:source.

Thanks for the clarifications !  I stoped using explicit  restrict on the switches because the C99 option (  /Qstd=c99  ) should include that. About the "source"  or "double"  evaluation for floating point, I use "source"  in some code where single pres. is acceptable and "double" where needing  to preserve all the bits of the representation.  Of course, mixing has a performance penalty.

>>...Of course, mixing has a performance penalty...

It is up to you to decide what is more important, that is, a precision or performance. I'm confident that some balance between these two almost opposite things could be always found. Take a look at:

Forum topic: Mixing of Floating-Point Types ( MFPT ) when performing calculations. Does it improve accuracy?
Web-link: software.intel.com/en-us/forums/topic/361134

as soon as you have time. Thanks.

Hi  Sergey,  Good point and topic !   (  source  vs  double  :   precision vs performance )

I learned about that three years ago when migrated  my projects from Watcom C (OpenWatcom at the time) to Intel.  In Watcom every floating point is processed with the traditional FPU (80 bits)  and the compiler use as acummulator an FPU register.   So my first .exe   generated with Intel gave different and worst results.  Using the switch   /fp:double  in intel was enough to reach same results, as far as they are physically relevant.

By the way,  if you like to try good old  C/C++ compilers, take a look a Watcom.  I think that it has not support for modern flavors of  C++.  But, for example, integer processing , bit operations, etc are great .  In the floating point processing  it is weak, because there is no use of SSEx.   Other drawback (for me) is lack of support for OpenMP.    In the past (some 15 years ago)  Watcom was the performance winner in most contest, in front of Borland, MS, etc.

>>...In the past (some 15 years ago) Watcom was the performance winner in most contest, in front of Borland, MS, etc...

I know it because I used Watcom C++ compiler for developing Netware Loadable Modules ( NLMs ) between 1994 and 1998.

Login to leave a comment.