I have run VTune on several sections of code generated by C and FORTRAN and have noticed the following pattern in critical loops. I think I have the compiler and VTune options set properly to show me a mixed code and assembler listing with the actual number of clock cycles required for each instruction. The problem I've seen is that the same instruction may take vastly different times to perform. This is especially true about the if()then branch at the top of a tight loop. It may take 100 times as long to execute as it should. Is this possible or am I misinterpreting the output of VTune?
I think it's true; because I'm very familiar with assembler and I've tried moving the code around, adding instructions, and using the high precision timer to confirm this. What it seems to be is the preprocessor doing a really bad job of predicting what's coming so that the processor is stumbling all over itself.
The compilers are generating efficient code. I couldn't write more efficient code if I wrote the entire loop in assembler; but it's always going to be slower than a heard of snails; because the one instruction near the top of the loop takes longer than the following dozen instructions!
It looks like the $#@% preprocessor is the bottleneck in my application! If the preprocessor does such a %$#@! poor job of predicting what's coming and messes things up so bad that it takes so long to clean up the carpet before the processor can proceed, then the length of time to perform any given instruction is unknown and totally out of control.
Then of what use is VTune? Or of what use is an optimizing compiler? Wouldn't I be better off without a preprocessor? This is so %$#@! irritating! I can see an instruction taking a few extra clock cycles; but an extra 1500? an extra 3000! Something is really wrong here!
Message Edited by email@example.com on 06-14-200607:13 AM