SSE optimized memcpy not faster

SSE optimized memcpy not faster

I'm trying to optimize the standard memcpy() to use SSE2. However, my tests show that there is little/no difference between the system memcpy(), my proprietary memcpy, and my optimized SSE2 memcpy.

When running the release code the result is as follows:

memcpy() took:
634.193 ms.
MemCpy() took:
627.133 ms.
MemCpySse2() took:
605.482 ms.

Project compiler flags:

/GS- /GA /Qrestrict /W3 /Zc:wchar_t /Zi /O3 /Fd"Release\vc110.pdb" /fp:fast /Quse-intel-optimized-headers /D "_WINDOWS" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /arch:SSE2 /Gd /Oi /MT /Fa"Release\" /EHsc /nologo /Qparallel /Fo"Release\" /Ot /Fp"Release\ConsoleApplication1.pch" 

Code snippe from project:

// SSE2 optimized memcpy()
void *MemCpySse2(void *__restrict b, const void *__restrict a, size_t n)
char *s1 = (char*)b;
const char *s2 = (const char*)a;
for(; 0<n; --n)*s1++ = *s2++;
return b;

// General memcpy
void *MemCpy(void *dest, const void *source, size_t count)
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);

// Copy 64-bit blocks first
_UINT64 *sourcePtr8 = (_UINT64*)source;
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];

if (!bytesLeft) return dest;

blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);

// Copy 32-bit blocks
_UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];

if (!bytesLeft) return dest;

blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);

// Copy 16-bit blocks
_UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];

if (!bytesLeft) return dest;

// Copy byte blocks
_UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
return dest;

Full Visual Studio 2012 project is attached.

Downloadapplication/zip win32sse2memcpytest.zip8.16 KB
9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

ICL normally replaces memcpy() or equivalent for loops by its own library functions which attempt to choose an optimum code path at runtime.

If you would look at pre-processed code or /Qopt-report you should be able to see when this happens. According to your options, you have requested the IPP headers, but your code appears non-standard as you have memcpy() without the corresponding header file.  VS2012 itself probably has a good SSE2/AVX memcpy(); on my laptop it seems to perform the same as what ICL provides.

So your observation is probably no surprise, unless it surprises you.

If you wish to suppress the library function call substitution, so as to have ICL generate in-line code, you should be able to do so by #pragma simd or #pragma omp simd.  Then, if you wish to obtain streaming stores for a smaller string move than what is required for the library function to make the switch, you can set #pragma vector nontemporal.

You don't say much about what you expect, what are your goals,.....

What header must i include after /Quse-intel-optimized-headers to make use of the ICL memcpy() ? Can't find any documentation on it anywhere.

I'm guessing you may be getting the ICL intel_fast_memcpy even without #include <string.h> and intel-optimized-headers but you have nothing to lose by correcting the usage.

You don't use any prefetches in your optimized version.

>>When running the release code the result is as follows:
>>memcpy() took:
>>634.193 ms.
>>MemCpy() took:
>>627.133 ms.
>>MemCpySse2() took:
>>605.482 ms.

SSE-based memcpy functions with prefetches are very fast especially on blocks larger than 64K. Let me know if you need test results.

It is also not clear how deterministic your tests. DId you switch a process priority to High or Realtime before measurements?

Intel Optimization Reference Manual has a good example of mem copy routine(SSE inline assembly) with software prefetches and loop unrolling.Adding prefetch instructions should improve the performance beacuse of linear pointer arithmetics.I mean indices are not randomized.

Inge, Please take a look at:

Here is a short summary: prefetching, non-temporal and fencing techniques need to be used.

Here are a couple of more things to consider:

In case of very large data sets, greater then 512MB, or so, an OpenMP-based memcpy-like function will be more effective compared to a single threaded function. The most interesting thing is that with OpenMP it is a matter of minutes to implement it. Right? It is also Not too difficult to use Auto-Parallelization of modern C++ compilers for extreme cases.

For example, if I need to copy an 8K block of memory I will use what I have by default, that is, CRT-based memcpy because I know that up to 64K there won't be any performance improvements.

>>...What header must i include after /Quse-intel-optimized-headers to make use of the ICL memcpy() ? Can't find any
>>documentation on it anywhere.

You need to enable Use Intrinsic functions in project settings or in command line.

If you invoke memcpy explicitly and don't get a link failure, it means you are using a memcpy from the compiler support library (aside from a few cases where a compiler may view that a pair of in-line instructions performs it better). You would be able to see from /Qopt-report or by using dumpbin whether it was a substitution of intel_fast_memcpy.  Again, chances are you got the same substitution even though your source code isn't technically correct.

Leave a Comment

Please sign in to add a comment. Not a member? Join today