Hello,
I've been working on a vision processing system written in C++. It works great, but I wasn't pleased with the CPU usage, so I began to run analysis tools on the program to see why my program was taking up 20% of the CPU time. I found one segment of code that the compiler refused to optimize (and I understand why -- it would have to make an assumption that would be unsafe to make, though I know to be true). So I decided to write the offending function in assembly and optimize it myself with SSE (I'm talking about processing a ton of pixels, so SSE offers the capability I'm looking for).
It worked great, and brought my CPU usage from 20% down to about 4% idle, 10% active, which I consider acceptable for my application, but I'd like to go more. So I analyzed it again. I was originally concerned with the "MOVDQU" instruction that I was using because I know I could take a severe penalty for that instruction, so I expected that to show up. However, my results suggested otherwise. Here is a block of the code:
Two things seem odd to me:


