_mm_storeu_si128 issue

_mm_storeu_si128 issue

I was really surprised to see some very disturbing results with the Intel C++ compiler for the following code.

int main(int argc, char* argv[])
#pragma pack(push,16)
unsigned char bfr [528];
__m128i value;
value = _mm_setzero_si128();
__m128i * bfr128 = (__m128i *)bfr;
__int64 StartClock;
_asm {
mov DWORD PTR StartClock, eax
mov DWORD PTR StartClock+4, edx
for (int i = 0; i < sizeof(bfr)/sizeof(value); i++) {

__int64 EndClock;
_asm {
mov DWORD PTR EndClock, eax
mov DWORD PTR EndClock+4, edx
printf ("
Elapsed CPU clocks: %I64d, %I64d, %I64d

", EndClock, StartClock,EndClock-StartClock);
//for (int i =0; i < sizeof(Iu8vec16); i++)
//value[i] = 0;
#pragma pack(pop)
return 0;

MSR's are as follows:
MSR registers
CPU - Pentium M (Banias), 0.13u, family 6, mode 9, stepping 5, revision B1.
FSB speed 99.7

Compiled with Intel compiler using the intrincics gives me a whopping 247 clocks to execute the _mm_storeu_si128(..)

The same code compiled without the Intel compiler and using the MSVC and memset instead of the intrincics, gives a 47 clocks output.

Any suggestions what is happening ? We are planning to use some of the intrinsics to optimize some large data movements in our code.

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If you are trying to do un-aligned moves and get the occasional advantage of fast parallel store when there is alignment, why not use memcpy() ?

Even so, how does that explain the huge difference in the VC compiled memset and Intel compiled memst.

Also, are you suggesting that 'memcpy' would be faster as opposed to 'memset' ?

Would it improve performance if I inline assembly it using rep stosb etc. ?

Why is the unaligned access making such a difference esp if the write is going to the cache. It will at most occupy about 9 cache lines correct ?
Given that, only the last line would cause misalignment issues right? Or am I missing something.
Any suggestions would be invaluable.

I'm sorry, you did say that you were attempting a memset() kind of operation. In thelibrary which comes with the Intel compiler, the library team has made an effort to optimize performance for the primary supported architectures. You would expect the library to examine the size and alignment of the operand and make appropriate choices. I suppose that building with icc/QxB should make choices appropriate to Banias.

Performance of un-aligned operations is extremely poor, when there is a cache line split. If you are writing to parts of 9 cache lines, you are guaranteed 8 of those. If the library uses _mm_store operations, I would expect it to adjust sizes so that it uses only aligned operations. It likely would store bytes up to the first 16-byte aligned boundary, use aligned 16-byte stores up to the last 16-byte boundary, and finish up with byte operations.

I suppose that Banias would do a better job with rep stosb than would a P4, but I have many colleagues more knowledgeable about this than I. That subject gets into the performance issues with particular processor steppings.This might well come out better than operations involving cache line splits, but is unlikely to match an expertly programmedmemset() for large regions such as you have.

The library should be your friend here, enabling you to achieve both portability and performance.

Leave a Comment

Please sign in to add a comment. Not a member? Join today