I experience an strange situation when I am optimizing some mic code.
The new optimized code runs faster, measured using __rdtsc(). But the new run time is actually slower than the old code! The code, by the way, is not a loop, and my co-worker found sometimes loop runs faster then unrolled loop. This lead me to speculate that icache may be starved due to too may vector operation, so I added _mm_delay_32(n) to let it recover.
This is the result I got
no delay added -- run time 6.65
delay(4) -- run time 6.63
delay(8) -- run time 6.60
delay(12) -- run time 6.57
So can someone verify where my speculation has any basis in fact?