SSE4 Register-Handling

SSE4 Register-Handling

I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.

To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.

I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?

Best Regards

Jambalaja

22 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If you look at smmintrin.h

...
* Principal header file for Intel(R) Core(TM) 2 Duo processor
* SSE4.1 intrinsics
...

you will see the following declaration for the _mm_mpsadbw_epu8 intrinsic function:

...
/*
* Sum absolute 8-bit integer difference of adjacent groups of 4 byte
* integers in operands. Starting offsets within operands are
* determined by mask
*/
extern __m128i __ICL_INTRINCC _mm_mpsadbw_epu8( __m128i, __m128i, const int );
...

So, with SSE4.1 a union __m128i has 8 unsinged integers. Take a look at immintrin.h:

...
extern __m256i __ICL_INTRINCC _mm256_mpsadbw_epu8( __m256i, __m256i, const int );
...

In that case AVX's union __m256i has 16 unsinged integers, and so on. In overall, a hardware upgrade could be considered.

Hi Sergey, 

thanks for your reply. So if i use AVX instructions (which are the operating on the same registers than SSE, just with 128bit more) i can probably reach twice the performance. But can i also reach twice the performance if i use all 16 registers instead of 8? Or are there any dependencies, like "you cannot compute register 1 and register 9 parallel" or sth. like that?

best regards

>>...But can i also reach twice the performance if i use all 16 registers instead of 8?..

With SSE4.1 in order to increase performance for a couple of percent you could try to boost priority of a process / thread to High or even to Realtime.

With AVX or AVX2 the processing has to be faster but take into account that a CPU will be also faster ( some kind of double improvement: wide registers and faster CPU ).

However, a Throughput of the instruction is the same and if it is 2 than two instructions will be executed in parallel on any system ( this is what you need ). Take a look at Intel Software Development Manual for more technical details ( there is a Chapter for Instruction Latencies and Throughputs ).

>>...However, a Throughput of the instruction is the same and if it is 2 than two instructions will be executed in parallel on
>>any system ( this is what you need )...

I know about that article and example since 2008. So, I've looked at sources and this is how processing looks like:

...
row1 = _mm_sad_epu8( row1, row2 );
row3 = _mm_sad_epu8( row3, row4 );
sum = _mm_add_epi16(sum, row1);
sum2 = _mm_add_epi16(sum2, row3);
...

> But can i also reach twice the performance if i use all 16 registers instead of 8?

Operating on more registers doesn't boost performance by itself. But it does help if your data processing context does not fit in 8 registers, and when your data processing algorithm is friendly to instruction-level parallelism (i.e. when multiple instructions within a single thread can be issued in parallel by the CPU), given that you're not saturating execution units already. From my experience, most often algorithms are memory or computation bound.

> Or are there any dependencies, like "you cannot compute register 1 and register 9 parallel" or sth. like that?

No, at least to my knowledge. All xmm/ymm registers are independent and interchangeable. However, there are dependency stalls which basically happen when CPU cannot execute subsequent instructions before the (supposed) input data for these instructions is ready (this is what counters instruction-level parallelism).

jimdempseyatthecove's picture

Adrian,

I think it would be productive for you to post a complete function indicating where the choke point is locatate. Trying to give advice with generalities is often counter productive. What is usually effective is to interleave the memory load and stores with computations as opposed to packing registers all together with loads, then perform all register to register calculations, then dump the results registers of interest back to memory.

Jim Dempsey

www.quickthreadprogramming.com

>>...I think it would be productive for you to post a complete function indicating where the choke point is locatate...

Attached is a zip-archive with sources.

Attachments: 

AttachmentSize
Download w-motion-est.17686.zip6.6 KB

my code is posted below.

as you can see im not using the registers in a smart way. i dont even have to use s8-s14.

maybe i should load all registers first, then compute, load, compute, ... like Jim said. ill try it.

left_ptr = left->ptr(x);
right_ptr = right->ptr(y);
right_shift_ptr = right->ptr(z);

// row1
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// row2
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// row3
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

//row4
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

jimdempseyatthecove's picture

Adrian,

You mis-read my statement, read it again. You have it backwards. You want to interleave the memory fetches and stores to some extent with register-to-register instructions, and you want to defer using destination register of memory fetch for some time after the fetch instruction. Using Sergey's uploaded code as an example:

Original code (edited slightly for reformatting for this forum message:

for(iX = 0; iX <= (iRight - iLeft); iX++){
  sum = _mm_xor_si128(sum, sum);      // Clear accumulator
  sum2 = _mm_xor_si128(sum2, sum2);  // Clear accumulator
  row2 = _mm_loadu_si128((__m128i *)pucR);
  row4 = _mm_loadu_si128((__m128i *)(pucR + iWidth));
  row6 = _mm_loadu_si128((__m128i *)(pucR + 2*iWidth));
  row8 = _mm_loadu_si128((__m128i *)(pucR + 3*iWidth));
  row1 = _mm_load_si128((__m128i *) pucC);
  row3 = _mm_load_si128((__m128i *) (pucC + iWidth));
  row5 = _mm_load_si128((__m128i *) (pucC + 2*iWidth));
  row7 = _mm_load_si128((__m128i *) (pucC + 3*iWidth));
  row1 = _mm_sad_epu8(row1, row2);
  row3 = _mm_sad_epu8(row3, row4);
  sum = _mm_add_epi16(sum, row1);
  sum2 = _mm_add_epi16(sum2, row3);
  row5 = _mm_sad_epu8(row5, row6);
  row8 = _mm_sad_epu8(row7, row8);
  sum = _mm_add_epi16(sum, row5);
  sum2 = _mm_add_epi16(sum2, row7);
  row2 = _mm_loadu_si128((__m128i *)(pucR + 4*iWidth));
  row4 = _mm_loadu_si128((__m128i *)(pucR + 5*iWidth));
  row6 = _mm_loadu_si128((__m128i *)(pucR + 6*iWidth));
  row8 = _mm_loadu_si128((__m128i *)(pucR + 7*iWidth));
  row1 = _mm_load_si128((__m128i *) (pucC + 4*iWidth));
  row3 = _mm_load_si128((__m128i *) (pucC + 5*iWidth));
  row5 = _mm_load_si128((__m128i *) (pucC + 6*iWidth));
  row7 = _mm_load_si128((__m128i *) (pucC + 7*iWidth));
  row1 = _mm_sad_epu8(row1, row2);
  row3 = _mm_sad_epu8(row3, row4);
  sum = _mm_add_epi16(sum, row1);
  sum2 = _mm_add_epi16(sum2, row3);
  row5 = _mm_sad_epu8(row5, row6);
  row7 = _mm_sad_epu8(row7, row8);
  sum = _mm_add_epi16(sum, row5);
  sum2 = _mm_add_epi16(sum2, row7);

Notice how all the loads are congregated together. A re-arrangement of the statements will produce more efficient code. 

for(iX = 0; iX <= (iRight - iLeft); iX++){
  // Get SAD for block pair
  row1 = _mm_load_si128((__m128i *) pucC);             // first memory fetch
  row2 = _mm_loadu_si128((__m128i *)pucR);             // second memory fetch (first fetch in-flight)
  sum = _mm_xor_si128(sum, sum);                       // Clear accumulator (first and second fetch in-flight)
  sum2 = _mm_xor_si128(sum2, sum2);                    // Clear accumulator (first and second fetch in-flight)
  row3 = _mm_load_si128((__m128i *) (pucC + iWidth));  // third memory fetch  (first and second fetch in-flight)
  row4 = _mm_loadu_si128((__m128i *)(pucR + iWidth));  // forth memory fetch (first, second, and third fetch in-flight)
  row1 = _mm_sad_epu8(row1, row2);                     // operaton dependent on first and second in-flight memory fewtches (hopefully complete by now)
  row5 = _mm_load_si128((__m128i *) (pucC + 2*iWidth));
  row6 = _mm_loadu_si128((__m128i *)(pucR + 2*iWidth));
  row3 = _mm_sad_epu8(row3, row4);                     // operaton dependent on third and forth in-flight memory fewtches (hopefully complete by now)
  row7 = _mm_load_si128((__m128i *) (pucC + 3*iWidth));
  row8 = _mm_loadu_si128((__m128i *)(pucR + 3*iWidth));
  sum = _mm_add_epi16(sum, row1);
  sum2 = _mm_add_epi16(sum2, row3);
  row5 = _mm_sad_epu8(row5, row6);
  row8 = _mm_sad_epu8(row7, row8);
  row1 = _mm_load_si128((__m128i *) (pucC + 4*iWidth));
  sum = _mm_add_epi16(sum, row5);
  row2 = _mm_loadu_si128((__m128i *)(pucR + 4*iWidth));
  sum2 = _mm_add_epi16(sum2, row7);
  row3 = _mm_load_si128((__m128i *) (pucC + 5*iWidth));
  row4 = _mm_loadu_si128((__m128i *)(pucR + 5*iWidth));
  row1 = _mm_sad_epu8(row1, row2);
  row5 = _mm_load_si128((__m128i *) (pucC + 6*iWidth));
  row6 = _mm_loadu_si128((__m128i *)(pucR + 6*iWidth));
  row3 = _mm_sad_epu8(row3, row4);
  row7 = _mm_load_si128((__m128i *) (pucC + 7*iWidth));
  row8 = _mm_loadu_si128((__m128i *)(pucR + 7*iWidth));
  sum = _mm_add_epi16(sum, row1);
  sum2 = _mm_add_epi16(sum2, row3);
  row5 = _mm_sad_epu8(row5, row6);
  row7 = _mm_sad_epu8(row7, row8);
  sum = _mm_add_epi16(sum, row5);
  sum2 = _mm_add_epi16(sum2, row7);
  ...

The above is a (partial) re-write from Sergey's sample program. The important point to know is the processor is capable of having multiple memory fetches in-flight at a time, and as long as you do not use the result of the fetch (while fetch in-flight) the processor won't stall. IOW you want to fill in the fetch latencies with other useful (non-fetch dependent) instructions.

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim and Sergey,

thank you very much for you code examples. I will think about it, write it down and check the performance.

best regards 

adrian

Changed my code, looks like i got a slight performance boost. im trying to avoid calculations that have to "wait" for data and filling the "holes" with sth else (like other calc or pointer increments)

// load row1
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_setzero_si128();
s4 = _mm_setzero_si128();
s5 = _mm_setzero_si128();
s6 = _mm_setzero_si128();

// 1-2
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row2
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_setzero_si128();
s12 = _mm_setzero_si128();
s13 = _mm_setzero_si128();
s14 = _mm_setzero_si128();

//calc row1
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));

//2-3
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row3
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row2
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//3-4
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row4
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row3
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//4-5
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

//load row5
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row4
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//5-6
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

//load row6
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row5
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//6-7
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row7
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row 6
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//7-8
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row8
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row7
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//calc row8
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));

thanks again and best regards

adrian

jimdempseyatthecove's picture

Adrian,

If you are running on 32-bit build reduce the number of "registerized" variables (s0 to s14)

If you are running on 64-bit build, examine the dissassembly code of Release Build,by inserting break point then opening up dissassembly window. ICC does not permit you to make a .ASM listing file in Release Build.

Examine the code to assure that all your s0-s14 variables are indeed in registers.
Note, although the x64 CPU has 16 registers, your emm intrinsics generally will require some scratch (unused) registers.

s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));

The mpsadbw will require a temporary register.

Even in the event that xmm15 is available

s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));

Would likely stall for the adds to finish.

You may need to free up so you have more unused SSE registers (so the compiler can do its magic).

Jim Dempsey

www.quickthreadprogramming.com

You should take into account that the compiler usually does a great job in reordering instructions and allocating registers in the most efficient way. I mean, it doesn't harm if you arrange your code that way, but don't expect that if you don't the resulting asm will carry out expressions the way you've written in the high level language. For example, the compiler is free to map s0-s2 and s8-10 to the same 3 xmm registers since their values lifetimes don't intersect. It can also reorder _mm_adds_epu16 and _mm_mpsadbw_epu8 to reduce dependency stalls. OTOH, the compiler may need to insert some other instructions, e.g. to perform operations in non-destructive manner.

My point is that you should generally inspect the code generated by your compiler to see the actual code and find bottlenecks. Then, if you find ones, you should try helping the compiler by adjusting your code. You may fail to do so, if the compiler is stubborn enough to ignore your attempts. In these cases good old asm is the answer.

Quote:

adrian s. wrote:

Changed my code, looks like i got a slight performance boost. im trying to avoid calculations that have to "wait" for data and filling the "holes" with sth else (like other calc or pointer increments)

This is generally not an effective optimization (a lot of effort for very low or no speedup) method since the OoO execution engine in modern cores do already a very good job at "filling the holes" as you say (not to mention that with 2 running threads per core this kind of reasoning don't stand anymore). Also the compiler may decide to do a completely different scheduling than the one in your source code, you can often see that the ASM dumps don't change even if you re-order the source code since the compiler still use the same (optimal as per its heuristics) register allocation and scheduling order.

 Quote:

adrian s. wrote:

// load row1
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_setzero_si128();
s4 = _mm_setzero_si128();
s5 = _mm_setzero_si128();
s6 = _mm_setzero_si128();

// 1-2
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row2
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_setzero_si128();
s12 = _mm_setzero_si128();
s13 = _mm_setzero_si128();
s14 = _mm_setzero_si128();

//calc row1
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));

//2-3
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row3
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row2
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//3-4
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row4
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row3
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//4-5
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

//load row5
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row4
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//5-6
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

//load row6
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row5
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//6-7
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row7
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row 6
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//7-8
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;

// load row8
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row7
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//calc row8
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));

thanks again and best regards

adrian

If you use the Intel compiler I encourage you to try the #pragma unroll(N) notation, it will allow you to rewrite it as a simple loop (8x simpler) and I'm quite sure you will get the same level of performance without all this duplicated code, it will be faster to write and a lot easier to maintain, maintenance may include new optimizations that you can't afford to try with the copy&paste version, the end result is a win-win situation, faster code overall and more code produced.

I will also suggest to replace the 3 incremented pointers by a single induction variable as in :

  s8 = _mm_loadu_si128((__m128i*)left_ptr+i); s9 = _mm_loadu_si128((__m128i*)right_ptr+i); s10 = _mm_loadu_si128((__m128i*)right_shift_ptr+i);    
  i += increment; // I wonder how can you use "const" as a variable name ? 
 

As bronxv indicated, compilers often generate better unrolled code if you use their options for that purpose. This should make it easier to find an optimum.   Besides the icc pragma unroll() you have options such as gcc -funroll-loops --param max-unroll-times=4 (since unrolling by 4 may be enough, unless you have a very early core i2 CPU which may not have fully effective Loop Stream Detection).  Micro-op cache on newer CPUs also deals with smoothing out performance artifacts associated with degree of unrolling.  If you are benchmarking on an out-of-production CPU, you should weigh the benefit of small performance increments which won't apply to a current one.  This forum is about optimization on more recent CPUs.

The unroll factor as used by both Intel and gnu compilers is on top of the parallelism implied by the simd register width.

In 32-bit mode, it's particularly important to minimize the number of loop-carried pointers, and it couild help with compiler optimization if you explicitly show that all memory accesses can be done with invariant offsets from a single pointer.

As to using more than the minimum number of registers, the hardware will take care of it by register renaming, possibly more efficiently than you can do in source code, as long as you don't have implicit partial register dependence.

>>...If you use the Intel compiler I encourage you to try the #pragma unroll(N) notation...
>>
>>...compilers often generate better unrolled code if you use their options for that purpose...

Just a couple of days ago I had some issues with unrolling and I strongly recommend to do as better as possible testing. In my case there was a conflict between a piece of code with manual unrolling ( N-in-1 ) and #pragma unroll( N ) directive and a command line option /Qunroll:N ( all three were used! ). So, use just one method and don't mix all of them since a negative performance impact could be greater than ~5%.

Quote:

Sergey Kostrov wrote:
Just a couple of days ago I had some issues with unrolling and I strongly recommend to do as better as possible testing. In my case there was a conflict between a piece of code with manual unrolling ( N-in-1 ) and #pragma unroll( N ) directive and a command line option /Qunroll:N ( all three were used! ). So, use just one method and don't mix all of them since a negative performance impact could be greater than ~5%.

Good point, that's why I advise to ressort as less as possible to manual unrolling, once you have a manually unrolled loop it's way more difficult to experiment new optimizations, including extra compiler unrolling. Manual unrolling is pretty much like freezing the source code, in case of refactoring you must first get rid of the manual unrolling, rewrite it as a vanilla loop, then, if you have the time to do it, redo the manual unrolling, it's not manageable for a big project with hundreds of loops.

As you say, testing is key for top performance. The value of N in  #pragma unroll(N) may have a big impact (from a 10% slowdown to more than 20% speedup in some cases I have experimented with), particularly on short loops, typical optimal values in my cases are N=4 and N=8 but sometimes it is N=3, it's easy to test with the assistance of the compiler but will be a nightmare to experiment with manual unrolling, so manual unrolling in practice is not only much more painful but will generally not use the optimal unrolling factor, a loss-loss situation.

Note that AFAIK a local #pragma unroll(N) always take precedence over the global /Qunroll setting.

>>...20% speedup in some cases I have experimented with), particularly on short loops, typical optimal values
>>in my cases are N=4...

I agree that 4-in-1 is the best option and it is based on lots of testing. However, different compilers have different optimal unrolling values and I'll post some example.

>>...different compilers have different optimal unrolling values...

#if ( defined ( _WIN32_MSC ) )
// #define MatMulCoreProcessing MatMulCoreProcessingRolled
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled2in1
#define MatMulCoreProcessing MatMulCoreProcessingUnRolled4in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled8in1
#endif
#if ( defined ( _WIN32CE_MSC ) )
#define MatMulCoreProcessing MatMulCoreProcessingRolled
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled2in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled4in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled8in1
#endif
#if ( defined ( _WIN32_MGW ) )
// #define MatMulCoreProcessing MatMulCoreProcessingRolled
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled2in1
#define MatMulCoreProcessing MatMulCoreProcessingUnRolled4in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled8in1
#endif
#if ( defined ( _WIN32_BCC ) )
// #define MatMulCoreProcessing MatMulCoreProcessingRolled
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled2in1
#define MatMulCoreProcessing MatMulCoreProcessingUnRolled4in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled8in1
#endif
#if ( defined ( _WIN32_ICC ) )
#define MatMulCoreProcessing MatMulCoreProcessingRolled
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled2in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled4in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled8in1
#endif
#if ( defined ( _COS16_TCC ) )
#define MatMulCoreProcessing MatMulCoreProcessingRolled
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled2in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled4in1
// #define MatMulCoreProcessing MatMulCoreProcessingUnRolled8in1
#endif

For all cases with MatMulCoreProcessingRolled a C++ compiler based unrolling needs to be used if it is supported. For example, Turbo C++ compiler ( _COS16_TCC ) doesn't support it.

Here are results of some tests for Rolled processing vs. Unrolled processing:

[ Rolled - 1-in-1 ]
...
...Pass 01 - Completed: 14.14100 secs
...Pass 02 - Completed: 13.18700 secs
...Pass 03 - Completed: 13.18800 secs
...Pass 04 - Completed: 13.17200 secs
...Pass 05 - Completed: 13.18700 secs
...

[ UnRolled - 2-in-1 ]
...
...Pass 01 - Completed: 7.71900 secs
...Pass 02 - Completed: 6.75000 secs
...Pass 03 - Completed: 6.76500 secs
...Pass 04 - Completed: 6.75000 secs
...Pass 05 - Completed: 6.75000 secs
...

[ UnRolled - 4-in-1 ]
...
...Pass 01 - Completed: 4.67100 secs
...Pass 02 - Completed: 3.71900 secs
...Pass 03 - Completed: 3.70300 secs
...Pass 04 - Completed: 3.70300 secs
...Pass 05 - Completed: 3.70300 secs
...

[ UnRolled - 8-in-1 ] Note: Best performance
...
...Pass 01 - Completed: 4.26500 secs
...Pass 02 - Completed: 3.32800 secs
...Pass 03 - Completed: 3.32800 secs
...Pass 04 - Completed: 3.31300 secs
...Pass 05 - Completed: 3.32800 secs
...

By the way, [ UnRolled - 8-in-1 ] is ~75% faster than [ Rolled - 1-in-1 ].

Login to leave a comment.