I've been working on a vision processing system written in C++. It works great, but I wasn't pleased with the CPU usage, so I began to run analysis tools on the program to see why my program was taking up 20% of the CPU time. I found one segment of code that the compiler refused to optimize (and I understand why -- it would have to make an assumption that would be unsafe to make, though I know to be true). So I decided to write the offending function in assembly and optimize it myself with SSE (I'm talking about processing a ton of pixels, so SSE offers the capability I'm looking for).
It worked great, and brought my CPU usage from 20% down to about 4% idle, 10% active, which I consider acceptable for my application, but I'd like to go more. So I analyzed it again. I was originally concerned with the "MOVDQU" instruction that I was using because I know I could take a severe penalty for that instruction, so I expected that to show up. However, my results suggested otherwise. Here is a block of the code:
; Hits (%) PXOR xmm3, xmm3 ; xmm3 = 0 0 (0%) PXOR xmm2, xmm2 ; xmm2 = 0 4 (0.01%) MOVDQU xmm0, XMMWORD PTR[eax] ; Get next 5 pixels 1 (0%) PCMPGTB xmm3, xmm0 ; 0 > xmm0? (is val byte negative) 2752 (6.74%) PCMPGTB xmm2, xmm1 ; 0 > xmm1? (is thresh byte negative) 103 (0.25%) MOVDQA xmm4, xmm1 ; Copy thresholds 1 (0%) PCMPGTB xmm4, xmm0 ; Test it! Did we exceed threshold? 2 (0%) PXOR xmm4, xmm7 ; 437 (1.07%)
Two things seem odd to me: