Slow code execution

Slow code execution

Hi,

when I try to execute the following code on my intel penryn ULV 1.4 core2duo, which consists of fn1() and fn2():

http://paste.org/70232

fn1() is visibly slower than fn2() - upon inspection of .s assembly code resulting from gcc -S I noticed that fn1() basically loops a decl instruction ~64 times and fn2() does seem to consist of ~23 instructions including 2 mul iinstructions which need to be repeated 10 times in this example. Despite this fn1() has ~3 times slower execution. (Compilation without -O otherwise gcc applies optimizations that alter the nature of fn1())

Would someone be so kind and elaborate what the cause is for fn1() slower execution?

 

thanks,

 

M

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
iliyapolak's picture

While looking at source code it seems that fn2() should be slower because of modulo operation and division-assignment operation.Upon closer inspection of fn2() variable i  is not used and optimizing compiler can exclude this line of code from the compilation.First function has 64 decrement operations and backward conditional jumps.

I suppose that during the looped execution of both functions inside the main()  fn2() could be further optimized by compiler when it realizes that fn2() is performing the same operation every loop cycle.

iliyapolak's picture

@mlf.c

Can you post disassembled code?

Are you trying to verify past research about Penryn partial flag stalls?

Do you remember how Intel worked to get compilers changed to use addl -1 in place of decl, and the world refused to use special options to handle this?

Are you tied to some specific combination of gcc version and -mtune options?

iliyapolak's picture

@Tim

Do you mean partial flag merge stalls?

Hi Iiya & Tim,

I haven't heard about the partial flag stalls but I tried it on icore7 and the results aren't much different other than both fn()s being executed faster, there's still a noticeable difference between execution time of fn1() and fn2() 

http://pastie.org/8694561

i have included the relevant parts - gcc seems to optimize /10 and %10 and uses mul instead of div.

Patrick Fay (Intel)'s picture

Hello mlf,

Are these 2 sections of code important to a real application or are you just curious?

When I run with optimizing turned on VC12, both routines get optimized away... since they don't return a value and don't change any non-local variable.

Assuming this is not just idle curiosity or a homework assignment: You don't really have any timer info around the routines so it is hard to say how many instructions/clocktick are getting executed by each function.

Pat

Hi Pat,

 

I have removed all parts of code that didn't seem to affect the speed of execution in order to pinpoint the problem and ended up with this simple piece of code - using clock() does show fn1() is much slower although its not very precise, but from looking at the assembly code posted above I assume movl, addl, subl, sall, shrl, cmp and jumps are still one clock instructions (haven't been coding for a while :) so there are 22 instructions + 2 mulls repeated 10 times as opposed to slower subl, cmp jns repeated 65 times. 

iliyapolak's picture

@mlf.c

Maybe presence of shrl instruction causes aferomentioned flags merge stalls?

Can you run VTune analysis on your code?

iliyapolak's picture

Actually cmp jmp branch instruction can be executed in parallel with variable decrement instruction,although dec instruction uop must wait probably for the result of branch instruction.

 

Login to leave a comment.