memcpy performance

Kategorien:

Abstract

Normal usage of memcpy() with gcc presents serious performance issues.  Implications and some possible avoidance measures are discussed for both 32- and 64-bit builds


Background

linux gcc carries on the Unix tradition from the early days of C, when the run-time libraries weren't considered as part of the compiler.  Thus, there is little coordination between compiler and library versions.  A normal compiler update will not update library functions such as memcpy().

New compiler installations, for example those from Intel, inherit most of the run-time support from what is provided (at least optional) in the linux installation.  This contrasts with the situation on Windows, where Intel compilers require a Microsoft compiler installation to provide the libraries, while gcc requires a companion run-time support such as cygwin or mingw.

Much of the OS support as well as C run-time support on linux is incorporated in the glibc library.  Although this is a Free Software Foundation project, and the bug report data base 'glibc bugzilla' is open to public inspection, the developer communications don't appear on a public list.  There is no visible public help forum or mail list.  It is not generally feasible to update glibc except as supported by the linux distribution.

Intel compilers modify the situation; for example, they pre-empt a few of the standard C functions, including memcpy().  As combined gcc/icc usage is supported, this leads to confusion about which run-time version is in effect.


Characteristics of memcpy()

memcpy(), as defined in C standard, copies a contiguous region of memory, size measured in characters.  The source and destination regions are assumed not to overlap, but no checking is performed.  The definition is required in <string.h>.  A common C prototype is:

void *memcpy (void *restrict __dest, const void *restrict __src, size_t __n)

The return value is pre-defined as

__dest.

glibc 2.6 <string.h> is consistent with this, but it uses gcc variants of const and restrict which are accepted by both gcc and g++.

Compiler optimization is facilitated by the const and restrict qualifiers, which assure that no hidden agent can modify the qualified objects.  If there is such a hidden influence, as when the 2 strings overlap, the result is undefined, so it does not matter what the compiler does.


memcpy() usage

Styles vary as to whether memcpy() usage is restricted to character strings of reasonable length, or may be used as a generic data movement for large chunks of memory.  The void (as opposed to char) data types clearly are meant to facilitate usage with a variety of data types.

On a cache architecture machine, one might argue that memcpy() is not necessarily appropriate for data movements so large that cache bypass should be invoked. 

For example, with icc,  if contiguous data regions are moved by a for() loop, #pragma loop count may be used to inform the compiler of the sizes for which the code should be optimized.  If memcpy were used on an operand which is half or more of the size of cache, the data pointed to by __dest in the prototype above will have been evicted according to cache capacity.  Thus, it would be better to use the code with compiler-visible #pragma loop count.  That would suggest the use of "non-temporal" parallel store, so as to avoid wasting time on filling and evicting destination cache lines.

libgfortran exemplifies the use of memcpy() for wider data types.  Current libgfortran eoshift, cshift, pack, unpack, and reshape intrinsics rely on memcpy.  If, in the future, these memcpy instances are replaced by specific data type moves, the source code will ex .


generic optimization of memcpy()

Common optimized implementations of memcpy() provide for algorithms which vary with data alignment and size.  The code should look for opportunities to perform as much of the work as possible with the widest data type supported by the hardware.  From the beginning of the i386, at least 32-bit data types have been useful for this purpose.   All current implementations of i386 have at least partial internal support for 128-bit moves.  If they do not actually support 128-bits in one chunk, they are capable of pipelining operations on two halves of suitable aligned 128-bit data.  So, the typical memcpy would move bytes until a suitable 16-byte aligned boundary is reached, then move chunks of 16 bytes at a time, following up by moving any remainder group of bytes. SSE2 movdqu/movdqa instructions were introduced specifically for this purpose.  movdqa is suitable for 16-byte aligned operands.  movdqu is suitable for fetching byte-aligned groups of 16 bytes from memory, but not useful for storing them. 

The Barcelona architecture prefers movaps for stores.  movaps, movdqa, and movapd are functionally equivalent, with movaps having shorter encoding.


memcpy() compiled with vectorizing compilers

All current compilers for linux should support SSE2 auto-vectorization with

#include <string.h>
void *(memcpy)(void *restrict b, const void *restrict a, size_t n){
    char *s1 = b;
    const char *s2 = a;
    for(; 0<n; --n)*s1++ = *s2++;
    return b;
}

This accomplishes the generic optimization goal of shifting between byte-wise and wide moves, without the evident problems of glibc asm code implementation.

The most significant difference between gcc and icc is that icc generates both movdqu/movdqa and movdqa/movdqa code, choice at run time.  The latter  should be faster for the case where the source and destination are relatively aligned.  This would happen frequently when operating on 64-bit data types.

Gcc also supports aggressive unrolling with the option -funroll-loops.  That is likely to show an advantage only where there are no significant cache misses. yet the operand size is 128 bytes or more.

If the restrict qualifiers are omitted, a compiler may report loop vectorization, but the run-time version selection may choose that version only under relatively unusual circumstances.  This puts one no further ahead than with the mediore glibc versions.


32-bit gcc memcpy()

32-bit gcc is aggressive about in-lining memcpy() into optimized code (-O2/-O3).  Unless the compiler has sufficient information to know of a contrary preference, 'rep movsl' will appear in the .o or .s code output.   No doubt, this was a good decision for support of architectures prior to P4, which introduced movdqu/movdqa instructions.  It still is a good mode for small (but not too small) size moves.  Evidently, a move which could be done with a single pair of instructions, if known at compile time, should be so optimized.  gfortran library performance testing shows satisfactory performance on operands up to 40 bytes with the default rep movsl.

This aggressive in-lining behavior may be removed by the gcc option -fno-builtin-memcpy  By itself, this option is inadvisable, as the separately compiled memcpy() usually provided in glibc is ineffective.  So, this option should be combined with inclusion of a vectorized memcpy().  gfortran cshift intrinsic can be speeded up by 30%,, even on the architectures where only 64-bit data are actually supported in hardware.  Due to the time spent calling an external function, the change from builtin-memcpy doesn't show any consis t gains with other gfortran intrinsics which would compensate for the loss of performance on short moves.


x86-64 gcc memcpy()

As x86-64 supports no non-SSE2 platforms, builtin-memcpy is not present.  glibc asm code apparently dates from the initial implementation, before vectorizing compilers were available.  Analysis by profilers such as Intel Performance Tuning Utility shows a number of cases where the glibc memcpy never shifts into wide move mode.  Among these are gfortran eoshift, pack, and unpack intrinsics.  Linking in a user-compiled memcpy(), using the source code presented above, nearly always improves performance.   In the cases where the glibc fails to find needed wide moves, performance increases by a factor of 3.

This has been done already in the 64-bit Windows gcc/gfortran distribution

As the time spent invoking a separately compiled function is relatively large when the operands are small, methods to reduce function call overhead would be interesting.  If the static qualifier is added in the source code presented above, it can be added to selected source files by #include, so that those sources see only the local memcpy().  This has been shown to improve performance by 20% for libgfortran intrinsics with 40-byte operands.  With this method, performance for large operands is unacceptable, even worse than the glibc.  In order to optimize static functions, gcc disallows use of xmm registers, so avoiding the time spend saving and restoring them.  This rules out vectorization.  As there is still significant overhead in the function call, it can never perform as well as the 32-bit builtin-memcpy.

Supplying a vectorized memcpy() will always improve performance over glibc.  For short operands, significant performance improvement can be achieved only by partial inlining, such as a static memcpy for each source module, but that ruins performance for large operands.


ia64 result

memcpy() compiled with gcc is a small improvement over glibc or icc memcpy() for the gfortran eoshift, pack, and unpack intrinsics, but shows a serious loss for large strings with the cshift intrinsic.  gcc doesn't shift automatically to wide moves.  icc accomplishes it by calling a library function, which appears to have the same difficulty as glibc in optimizing the eoshift, pack, and unpack cases.  25% gain appears possible for those cases by writing gcc code which shifts to int moves, with optimum performance for the cshift case as well.


Summary

Achieving good performance over the range of common usage of memcpy() presents insoluble dilemmas.  Provision of a vectorized version, as Intel compilers do, and is easily done by the user, is a good solution for large operands on x86-64, and cannot significantly hurt small operand performance on x86-64.

As gcc 32-bit makes aggressive use of inlined string moves with builtin-memcpy, hardware implementations which automatically switch into wide move modes would be more attractive than software support.  Considering that this idea has been around for over 30 years, we don't hold our breath for it to re-appear.

 


Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.