Bad code generation on inderect loops

Bad code generation on inderect loops

Portrait de Bert Jonson

Simple test-case:

#include <stdio.h>
int main() {
for(int i = 100; i--; )
 puts("hello worldn");
}

ICC 13 update 1 with /O3 generates this:

loc_401037:
push offset aHelloWorld ; "hello world\n"
call sub_401060
add esp, 4
dec esi
cmp esi, 0FFFFFFFFh
 jnz short loc_401037

cmp does totally noghing. GCC and MSVC generates better code without cmp.

BTW: when will be new maybe beta release of ICC with delegating constructors and other new C++ features?

10 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de bernaske

#include
2 int main() {
3 for(int i = 100; i--; )
4 puts("hello worldn");
5 }

1.) semicolon after i-- is wrong
2..) puts(" Hello world \n"); is the correct
3.) variable i is not defined
for( int i .... ) is allow in the next c / c++ standard (c99 and so far )

include
int main()
{
int i;

for( i = 0; i < 101; i-- )
puts("hello world \n");
}

this testcase works with option -O3 and Parallel Studio XE 2013 under linux without problems

Portrait de Sergey Kostrov

I'd like to ask you a couple of questions...

How many times do you want do display the 'hello world' phrase when for implemented as follows?
...
int i;
for( i = 0; i < 101; i-- )
puts( "hello world" );
...

[ Note ] In the above case the phrase will displayed ( ( 2^32 ) ) times. Is that what you wanted?

I understood that you simply wanted to verify a quality of code generation of Intel C++ compiler. Is that correct?

Best regards,
Sergey

PS: The following code will display the phrase 100 times:
...
int i;
for( i = 100; i > 0; i-- )
puts( "hello world" );
...

Portrait de Tim Prince

The originally quoted syntax of the for should be OK under C++ (as originally implied) or C99. Without fixing the worldn typo, you might overflow a buffer. As you imply Windows, the name of the compiler would be ICL, and its decision to use C++ normally would be based on the file name.
Like the others, I don't see how you can prove whether one instruction sequence or another is faster for controlling a loop which involves a function call and i/o. I assume that /O3 has little effect when the loop is executing a non-inline function call.
ICL does have a bias against downward counting for loops, although I can't see it making a difference in this case.
gcc has an automatic transformation to implement upward counting loops with downward count in certain situations (not where vectorization is a possibility). So it's hard to make a case that writing your loop with a downward count will optimize it.

Portrait de Sergey Kostrov

>>...gcc has an automatic transformation to implement upward counting loops with downward count in certain situations...

Tim, where did you read about it? That sounds very interesting and I think it could change a logic of some algorithms when there is a break statement inside of a for loop. It means, that a different number of iterations will be needed to hit a break confition. Is that important? Yes, because if a developer implemented a for with downward count something forced the developer to do it.

Best regards,
Sergey

Portrait de Bert Jonson

No, my code is correct and puts will be executed 100 times. We can trace it:

for(int i = 2; i--; ) {}

1: i == 2 and loop condition is ok(loop body has executed), "i" have to decremented and will be 1
2: i == 1, condition is ok(loop body executed), now i == 0
3: i == 0, so loop will be exited, but it also decrements "i" after exit, so after the loop i == -1, but there is no problem with it, we don't use i after loop.

So loop body will be executed 2 times, that we expected.

Next code generates MSVC on this loop:

loc_401006:
push offset aHelloWorld ; "hello world\n"
call sub_40101A
add esp, 4
dec esi
jnz short loc_401006

And GCC:
loc_4077B4:
mov [esp+14h+var_14], offset aHelloWorld ; "hello world\n"
call puts
sub ebx, 1
jnz short loc_4077B4

I don't say about only this code. It seems that ICC doesn't know that any cmp with 0 after inc/dec/sub/add is useless because dec/inc/sub/add already have to sets Z flag.
It simple wastes cpu ticks.

Portrait de Georg Zitzlsberger (Intel)

Hello,

that's a good finding! I forwarded it to compiler engineering (DPD200239520) and let you know about the progress.

Using a pre-decrement could be used as a workaround as it does not show the superfluous "cmp":


#include 

int main() {

    for(int i = 101; --i; )

        puts("hello worldn");

}


..B1.2:

        movl      $.L_2__STRING.0, %edi

        call      puts  
..B1.3:

        decl      %r12d

        jne       ..B1.2

%r12d starts with 100 here and does not need the "cmp" for underflow checking. In your original example the initial value was 99, thus requiring the inefficient handling of underflow.

I'd recommend pre-decrement/increment operators in general as they also have other advantages when it comes to OOP.

Best regards,

Georg Zitzlsberger

Portrait de jimdempseyatthecove

The code generated is correct, and not superfulous, though it could be coded differently

for(int i = 100; i--; )

is postfix, meaning the for loop body executes when i prior to -- is non-zero, and thus will execute body with i==0. The test for -1 is correct, however you could potentially use jge immediately following the dec (without the cmp).

*** note though, the code change (removal of cmp and use of jge) would not necessarily be faster. You can test for this with use of in-line assembly. If you do, be sure you align the code such that cache line issues do not skew the results to favor one technique over the other.

Jim Dempsey

www.quickthreadprogramming.com
Portrait de Georg Zitzlsberger (Intel)

Hello Jim,

"superfluous" in terms of performance, not in terms of semantic.
Yes, the generated code is correct and makes sense once the induction variable "i" is used. But here it is not and shifting the bounds can get us rid of the "cmp". The initial request is still valid -- admittedly for such rare cases only.

Best regards,

Georg Zitzlsberger

Portrait de jimdempseyatthecove

George,

>>"superfluous" in terms of performance, not in terms of semantic.

I had not run a proper test to verify if "dec; jge" is faster/slower/same as "dec; cmp; jg".
Compiler optimizations are not about "elegant semantics", rather "ultimate performance". Often the fewer instructions is faster but not always, I am pointing this out to the readers.

Besides, in Bert's second post:

>>I don't say about only this code. It seems that ICC doesn't know that any cmp with 0 after inc/dec/sub/add is useless because dec/inc/sub/add already have to sets Z flag...It simple wastes cpu ticks.<<

Testing for Z would have been in error (jnz). The code use of esi (to represent the value of i) would require testing the sign of the result (jge).
GCC correctly uses jnz because the loop control variable was recognized as not being used and thus the compiler substituted an interation count as opposed to using a loop control variable.

As to which is better (faster)... run a proper test and measure the results (CPU architecture will cause variance).

Jim Dempsey

www.quickthreadprogramming.com

Connectez-vous pour laisser un commentaire.