I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.
To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.
I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?