If you look at **smmintrin.h**

...

* Principal header file for Intel(R) Core(TM) 2 Duo processor

* SSE4.1 intrinsics

...

you will see the following declaration for the **_mm_mpsadbw_epu8** intrinsic function:

...

/*

* Sum absolute 8-bit integer difference of adjacent groups of 4 byte

* integers in operands. Starting offsets within operands are

* determined by mask

*/

extern __m128i __ICL_INTRINCC **_mm_mpsadbw_epu8**( **__m128i**, **__m128i**, const int );

...

So, with **SSE4.1** a union **__m128i** has **8** unsinged integers. Take a look at **immintrin.h**:

...

extern __m256i __ICL_INTRINCC **_mm256_mpsadbw_epu8**( **__m256i**, **__m256i**, const int );

...

In that case **AVX**'s union **__m256i** has **16** unsinged integers, and so on. In overall, a hardware upgrade could be considered.

## SSE4 Register-Handling

I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.

To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.

I stumbled over this Intel document. In Section F "

Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?Best Regards

Jambalaja