Optimize C code by SSE2

Optimize C code by SSE2

I have a code clip below 

BYTE* pD; BYTE *pU; BYTE *pV;
int n;
for (n=0; n<size; n++) {
  *pD++ = *pY++;
  *pD++ = *pU++; pU++;
  *pD++ = *pY++; 
  *pD++ = *pV++; pV++;
}

I tried to rewrite it by SSE2, however I do not know how to combine bytes by SSE. Any help?

BTW I simply tested following SSE code but found that CPU usage get worse in my program from 80% to 95%. I have thought that the use of SSE can improve CPU usage. The speed just improve a little bit. Where are things going wrong? I am running on Windows  8 Core i7 ultrabook.

int n;
__m128i tmp;
for (n=0; n<size; n+=16;) {
  tmp = _mm_load_si128((__m128i *)pY);
  _mm_store_si128((__m128i *)pD, tmp);
  pD+=32; pY+=16; pU+=8; pV+=8;
}

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It seems to me that your code copies from one memory destination to another one.

There is no calculation or anything else. I think I heard that the speed of copy actions can't be improved by SSE very much.

What task do you want to achieve? Is it only copying?

It copies from 3-memory sources to 1 destination. Actually it is a part of YUV space to RGB space conversion. The SSE code is not the translate of C right now but just for testing. Hope SSE or AVX can great improve the copy performance.   

Depending on the context, you should consider adding restrict qualifier (BYTE *restrict...) and aligning the destination. This might be done more efficiently with SSE4.1 or AVX, but might still require.#pragma vector always.  If the loop is long enough or being executed by enough threads to take advantage of nontemporal store, you would need to so specify e.g. #pragma vector aligned nontemporal

>>...
>>The speed just improve a little bit. Where are things going wrong?...
>>
>>int n;
>>__m128i tmp;
>>for ( n=0; n>{
>> tmp = _mm_load_si128( ( __m128i * )pY );
>> _mm_store_si128( ( __m128i * )pD, tmp );
>> pD+=32;
>>pY+=16;
>>pU+=8;
>>pV+=8;
>>}

Here are a couple of notes:

- If your data set is greater than 256KB a data prefetching, if properly applied (!), could improve performance ( please see Intel Software Optimization Manual )
- Verify in a Debugger that a C++ operator ( __m128i * ) is not used when passing pointers to data for intrinsic functions
- Usage of another pair of intrinsic functions _mm_stream_ps and _mm_load_ps could outperform your current SSE implementation

>>...I heard that the speed of copy actions can't be improved by SSE very much...

I'll give you a number and it is ~9%. Is it significant or not? I think that for real-time applications even a 0.5% improvement could be considered as a very good thing ( sorry for a small deviation... ).

Please take a look at some recent results which demonstrate how a correct application of _mm_prefetch improves performance of copy operations: http://software.intel.com/en-us/forums/topic/352880

>>>I'll give you a number and it is ~9%. Is it significant or not? I think that for real-time applications even a 0.5% improvement could be considered as a very good thing ( sorry for a small deviation... ).>>>

Yes for example in video rendering.

引文:

Sergey Kostrov 写道:

>>...I heard that the speed of copy actions can't be improved by SSE very much...

I'll give you a number and it is ~9%. Is it significant or not? I think that for real-time applications even a 0.5% improvement could be considered as a very good thing ( sorry for a small deviation... ).

Please take a look at some recent results which demonstrate how a correct application of _mm_prefetch improves performance of copy operations: http://software.intel.com/en-us/forums/topic/352880

Ah this is quite interesting, I did not know about the prefetch operations and that the improve performance.

With "not very much" I referred to SSE SpeedUp of 2 or 4 (double of float) which most of the time can only be reached if computations are involved.

Hi Christian,

>>...Ah this is quite interesting, I did not know about the prefetch operations and that the improve performance...

Please take a look at Intel Software Optimization Manual if interested to learn more about it ( there a couple of really good examples with prefetch instruction ).

Best regards,
Sergey

To TimP (Intel) 

"Depending on the context, you should consider adding restrict qualifier (BYTE *restrict...) and aligning the destination. This might be done more efficiently with SSE4.1 or AVX, but might still require.#pragma vector always.  If the loop is long enough or being executed by enough threads to take advantage of nontemporal store, you would need to so specify e.g. #pragma vector aligned nontemporal"

Your suggestion of using restrict is good and I have applied. I also tested the alignment and found one source is not aligned. Then I found a way to remove the unaligned pointer. It actually made improvement.  "#pragma vector aligned nontemporal" can not be applied by my compiler.

To Sergey Kostrov:

"- If your data set is greater than 256KB a data prefetching, if properly applied (!), could improve performance ( please see Intel Software Optimization Manual )
- Verify in a Debugger that a C++ operator ( __m128i * ) is not used when passing pointers to data for intrinsic functions
- Usage of another pair of intrinsic functions _mm_stream_ps and _mm_load_ps could outperform your current SSE implementation"

tried prefetch but no obvious improvement. Checked the asm code generated there are only two statements of load and store. no problem. 
By replacing _mm_store_si128 to _mm_stream_si128 I watched 0.5% improvement. 

In summary, I have tried the AVX code but the result is similar to SSE2. When I removed all the copy code clips and do nothing on copy the performance got 25% improvement. So this kernel is critical. Unfortunately SSE2/AVX can not do better than original C++ code.

>>tried prefetch but no obvious improvement. Checked the asm code generated there are only two statements of
>>load and store. no problem. By replacing _mm_store_si128 to _mm_stream_si128 I watched 0.5% improvement.

Thanks for these details. A 0.5% improvement actually matches to my old numbers when I didn't use prefetch properly. Here are a couple of more tips:

- Do you have April 2012 edition of Intel 64 and IA-32 Architectures Optimization Reference Manual? I wonder if you looked at Chapter 7 'Optimizing Cache Usage'?
- Warm your data before processing ( it is a very simple procedure )

A small example that demonstrates how _mm_prefetch could be used ( take into account sizes of L1, L2 and L3 cache lines of your CPU ) is as follows:
...
// Some initializations
...
RTint iPageSize = 4096;
RTint iCacheLineSize = 32;
...
for( RTint i = 0; i < iNumOfBytes; i += iPageSize )
{
RTint j;

for( j = i + iCacheLineSize; j < ( i + iPageSize ); j += iCacheLineSize )
{
_mm_prefetch( ( RTchar * )pvSrc + j, _MM_HINT_NTA );
}
...
// Processing of some data
...
}
...

Note: A 9% improvement was achieved when copying a data set of ~1MB in size

Sergey Kostrov:

I tried the example _mm_prefetch after reading Intel 64 and IA-32 Architectures Optimization Reference Manual. 
But the result is not satisfied.

Like memcpy my code is a non-tempral data processing. So stream computing can be done in pipeline. I made a new try and gain 10% overall performance increase and the kernel may have been speed up to 5x. Here are some key points:

1.  Fill a CacheLine in a block.

This requires the MOVNTDQA after SSE4.1. I am not sure the cacheline size but 4 MOVNTDQA (64bytes) loads can fill a cacheline. So no prefetch is needed. The difference is L1 access may take only 3 cycles vs DRAM access in 18 cycles? so it is 6x speedup. 

2. Use asm replace C

I read asm code after C compiler, a problem is that C compiler does not been optimized well. In my code, it load a sequence of data to xmm0 register only. The compiler is poor to use register allocation. By using asm I can use xmm0 to xmm15. So I can load a bulk of data and store a bulk of data. That is a perfect stream speedup.

3. Avoid using software cache control

Cache control is very very complicated problem. In the multi-thread and multi-core environment it is more complicated. Applying software control may cause uncertainity which is not suitable for stream processing.

4. AVX may be more helpful

Since AVX doubled the data bandwidth, stream non-tempral data can be accelerated more in moving.  

5. Instruction Cycles

Intel may be better to present developers the cycle number for each instruction of a processor in different conditions. So software developers can estimated what code clips can be optimized. For example, the cycle number of MOVNTDQA in L1 cache, L2 cache, L3 cache, in DDR3, .etc.   

 

On Sandy Bridge or Ivy Bridge, AVX nontemporal doesn't necessarily accelerate an application beyond 128-bit nontemporal.  Cache lines remain at 64 bytes.  If you have optimized code which depends on store bandwidth for earlier architecture, it may be sufficient for Sandy Bridge.

>>...Intel may be better to present developers the cycle number for each instruction of a processor in different conditions.
>>So software developers can estimated what code clips can be optimized. For example, the cycle number of
>>MOVNTDQA in L1 cache, L2 cache, L3 cache, in DDR3, .etc.

Absolutely agree with that point of view and Intel published clock cycle numbers for many instructions. However, some instructions are not on the list. It looks like MOVNTDQA is not on the list ( please correct me if it is not true ).

引文:

Sergey Kostrov 写道:

>>...Intel may be better to present developers the cycle number for each instruction of a processor in different conditions.
>>So software developers can estimated what code clips can be optimized. For example, the cycle number of
>>MOVNTDQA in L1 cache, L2 cache, L3 cache, in DDR3, .etc.

Absolutely agree with that point of view and Intel published clock cycle numbers for many instructions. However, some instructions are not on the list. It looks like MOVNTDQA is not on the list ( please correct me if it is not true ).

Agner's CPU documentation also has plenty information about the instructions cpi.

Finally get optimized to use all xmm registers inside the kernel code. C++ compiler requires right define and use __m128i variables to allocate all registers. It is amazing that a short 8 line code runs slower than 60 line code. This may require 3 to 4 register instructions of SSE4 per cycle to run faster on longer code. The basic optimization is

movzx ecx, BYTE PTR [esi]
mov BYTE PTR [eax+1], cl

Another advantage of using SSE4 is the stream stability. Since less cache missing hits the stream may flow in smooth. 

In my case AVX does not make lots of benefits because of the lack of shift operation in 256bit. AVX2 may be useful.

By the way, Intel IPP library has a set of highly optimized functions for convertion of a RGB image to and from an YUV format image. Take a look at ippiYUVToRGB_xx_xxx functions if interested.

I solved the problem by studying SSE4/AVX2 instructions. SSE4/AVX2 provide rich instructions and I found one I needed. Now code is compact and fast. Thinking of SSE4 as 128bit processor and AVX2 as 256bit processor, Intel actually near completed the definition of their instruction sets.  At present SSE4/AVX2 is hard to understand by programmers.

C language can not bring true optimization in some cases. When I unrolled 4 loops, the registers of SSE4 in x86 is not enough for using. In 32bit there are 8 xmm registers from xmm0 to xmm7. The xmm8 to xmm15 is only available for 64bit. Each loop kernel required 2 to 3 xmm registers in average. Unrolled 4 loops will need over 8 registers. Compiler simply allocates a memory unit to replace register when there is no enough registers. This makes optimization meaningless. In AVX2 because of 256bits I need only unrolled 2 loops. 

>>>C language can not bring true optimization in some cases>>>

Yes that's true.I think that true optimization can be achieved with the help of assembly language or inline assembly.C still lacks some some features which can be easily achieved with the machine code.

>>...I solved the problem...

Thanks for the update. I see that when it comes to the performance several techniques / tricks can be combined, like code optimization(s) ( unrolling ), application of new sets of SIMD instructions and correct usage of L1, L2 and L3 cache lines of some CPU ( ideally a 3-layred Loop Blocking optimization technique has to be used but it gets really complicated in a multi-threaded environment ).

Leave a Comment

Please sign in to add a comment. Not a member? Join today