I'm teachable and will to do my homework, but, it has been a while since I looked closely at CPU perfromance.
I'm looking at t little bit of rendering code that is stepping down a scanline and
seeing if it is done with a pixel,
then getting the next texture value - dq[p].
After the texture fetch it looks up the color in a 4096 RGBA lut and
tests if the color is zero. - test fzsl - this test takes 28% of the total time
I'm sure there are many ways this could be improved, but, my task now is just to understand it.
I've done several runs and the numbers at each statement are pretty stable.
So, can the test of fzsl be where the cache misses catch up? That is 28% of the total time is spent there.
Only 6% wher the fetch is initiated and another 8% when the result is used to look up the LUT.
Or, are these number really just more of a statistical neighborhood heisenberg type number and not specifically about the statements.
if ( ss==-1 ) continue; 0.18914
short sz = (ss>> 0) & 0xffff; 0.11956
short ez = (ss>>16) & 0xffff; 0.00992
if ( (sz>z) || (ez int dat = (dq[ p ]) & 0xffff; 0.585804 6% <<<<<<< texture fetch
int msk_val = dat & 0x0000f000; 0.00993
__m128 m_lut = _mm_load_ps( &h_lut_buf[ dat<<2 ] ); 0.75983988 8% <<<< LUT lookup
int fzsl = _mm_ucomilt_ss( m_lut, m_zsl );
if ( fzsl ) 2.36337947 28% <<<<<<<<<<< test fxsl
Thank you for any breadcrumbs!