slow memcpy

slow memcpy

Hello,

I'm optimizing an application using ICC. Vtune hotspot analysis shows that some wrapper functions for the C standard memcpy() function are among the functions with higher CPU time. That's quite unexpected as they only copy small (250-500B) user-defined data structures. The assembly shows a call to _intel_fast_memcpy and a red bar that marks the function poor performance.

In some other cases the memcpy wrapper functions are replaced with some mmx instructions, proving that the SSE4.1 support is enabled.

I was wondering if there are ways to make the poor performing memcpy wrappers more efficient.

I have tried to align the source and destination pointers to 16 bytes through __attribute__((aligned(16))), but that didn't help.

Thanks

publicaciones de 24 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Here a screenshot for one the memcpy wrapper for the user-defined data structure pb_intra_t, which contains some other user-defined data structure.

Adjuntos: 

AdjuntoTamaño
Descargar screenshot.png110.56 KB

Hello,

The color indicators in Intel® VTune™ Amplifier XE can be changed. Color red not always means bad. It depends on the individual context.
In your case function "copy_pb_intra" consumes approx 90 ms of your overall runtime. As I don't know the overall runtime I cannot say to which percentage this sums up to. If we talk about multiple seconds of runtime this might be negligible.
Nevertheless, memcpy is a quite compute intense function and spending much time there is good. We provide our own optimized versions of it for different processor generations (we spend quite some work on them being fast).
The function _intel_fast_memcpy (our alias for memcpy) is a wrapper function that dispatches the correct implementation depending on the underlying processor at runtime. We do not recommend to bypass this dispatch routine as it makes applications less portable. It's also important to note that _intel_fast_memcpy always needs a function call. If you copy only a small amount of data there's overhead regarding the call. That can be solved by design decisions in the application to e.g. chain memory locations and copy them in one rush (in kBs or MBs rather than just a few bytes).
Another approach is to inline the copy routines. If you compile with IPO (/Qipo/-ipo) the compiler might be able to inline memcpy and hence avoid the overhead of the call.

If memcpy really accounts for a big portion of the overall runtime you might think about copying more data at once. The dispatching overhead can be neglected (just very few instructions and a jump, not a call). The call to memcpy itself is more costly and can only be avoided if the compiler can inline it by using IPO.

Best regards,

Georg Zitzlsberger

In case your strings aren't long enough to amortize the overhead of calling a library memcpy function, you have the option of using a simple for() loop with #pragma simd or #pragma omp simd aligned(...) to suppress the function call and request in-lined simd code.  I hope Georg didn't have this in mind when he made the comment about less portable.  The omp simd will require the -openmp-simd compile option if using the 13.1 compiler.

There are some new interprocedural optimizations for intel_fast_memcpy in the 14.0 compiler, but I don't know which situations they would help.

I'm getting a significant improvement using the __assume_aligned_(ptr,n) when n=64, not 16 as I would expect. Any guess why that happens?

Hello,

I assume this is not directly related to the memcpy discussion above.
__assume_aligned(ptr, n) tells the compiler to assume that pointer "ptr" is only pointing to memory locations starting at an aligned address of multiple of n.
Our compiler can generate different versions of code (multi-version) optimized for different alignment. If alignment cannot be figured out by the compiler during compile time (e.g. values of pointers are defined at runtime by external allocation via malloc/realloc/new etc.) it has to add a runtime test for alignment and dispatch the correct version. That test can generate slight(!) overhead and is usually not noticeable.
However, in cases where (multi-version) functions with such tests are called very frequently, and their actual work is very small, those tests can become noticeable.

Note that if you use __assume_aligned(...) you tell the compiler to rely on what you told. If you name a pointer's memory location aligned and it is not, your code will likely fail, rather sooner than later.

For good performance it is important to care about alignment. __assume_aligned(...) is one aspect. Allocating aligned memory is another one (e.g. posix_memalign on Linux*).
Same is also true for using the "#pragma vector aligned" or the Fortran counterpart (directive).

Best regards,

Georg Zitzlsberger

Several recent Intel CPUs get a significant benefit for 32-byte data alignment over 16-byte alignment, even though the latter are sufficient for aligned SSE instruction execution.  Asserting 32- or 64-byte alignment would help the compiler when generating AVX-256 instructions, if the data are actually at least 32-byte aligned.

You may wish to compare opt-report and asm file output to see what use the compiler may be making of your alignment assertions.

Quote:

Georg Zitzlsberger (Intel) wrote:

However, in cases where (multi-version) functions with such tests are called very frequently, and their actual work is very small, those tests can become noticeable.

I think that's the case of my application.

Thanks

I oonly suppose that Intel implementation of memcpy on latest architecture (starting from SB) can exploit two load ports.

Quote:

iliyapolak wrote:

I oonly suppose that Intel implementation of memcpy on latest architecture (starting from SB) can exploit two load ports.

memcpy doesn't normally benefit from the 2 load ports on Sandy or Ivy Bridge on account of the single store port.  Presumably, Haswell could change this.

Actually on Haswellit could be two loads and one store per cycle.

I just wanted to say that creating a multi-threaded memcpy yields a significant performance increase if you're usually copying large memory blocks, it may be that it's sufficient to use the "/Qparallel" compiler flag though.

Quote:

Inge H. wrote:

I just wanted to say that creating a multi-threaded memcpy yields a significant performance increase if you're usually copying large memory blocks, it may be that it's sufficient to use the "/Qparallel" compiler flag though.

With OpenMP, in the 14.0 compiler, a parallel for such as

#pragma omp parallel for simd

for(int i = 0; i < N; ++i) a[i] = b[i]

may be more effective than a memcpy.

You can also add vectorization and loop unrolling to copy large memory blocks.

>>...I was wondering if there are ways to make the poor performing memcpy wrappers more efficient...

I've spent lots of time on R&Ds related to different memcpu functions and I think only a solid reproducer from you could help to understand what is exactly wrong.

Also, you could try to use similar functions from IPP library to verify your results. However, I have Not seen big differences ( just a couple of percents ) between CRT memcpy, IPP functions and a fully vectorized for-loop to copy some data in memory.

>>>may be more effective than a memcpy.>>>

Probably by the vectorization and loop unrolling.

Quote:

iliyapolak wrote:

>>>may be more effective than a memcpy.>>>

Probably by the vectorization and loop unrolling.

In view of saving the time of a library call and not needing to check for cases which aren't aligned to the data type.  Optional aligned clause (supposedly same effect as __assume_aligned) would enable in-line code generation to skip all alignment checks.

When forcing in-line code by means such as this, the check internal to fast_memcpy for a size big enough to invoke nontemporal stores is replaced by compile-time decision according to the streaming-stores option.  As OP mentioned strings up to 500 bytes, run-time check on length for nontemporal would simply waste time.

In favor of memcpy, on certain Intel CPUs, those substitutions improved code locality and reduced instruction TLB misses, when compared with many separate code expansions of individual for loops.

>>...I was wondering if there are ways to make the poor performing memcpy wrappers more efficient...

It is Not enough to use some set of SIMD instructions and additional techniques need to be used, like prefetching, non-temporal moves and fencing:
...
void FastMemCopy( void *pDst, const void *pSrc, size_t nBytes )
{
__asm
{
mov esi, dword ptr pSrc
mov edi, dword ptr pDst
...
MainLoop:
...
prefetch?? [ esi + some_offset ]
...
movaps xmm0, xmmword ptr [ esi ]
...
movntps xmmword ptr [ edi ], xmm0
...
jmp MainLoop

main_loop_end:
sfence
}
}
...
Note 1: This is a 32-bit version and for 64-bit version R prefixes for registers are needed, etc.

Note 2: Igor Levincki recently posted source codes of his 'FastMemCopy'-like function and it has to be available on IDZ.

I'd like to share my experience with evaluation of IPP memcpy-like functions vs. fully-vectorized memcpy-like for-loops.

I have Not seen differences in performance greater then a couple of percent plus or minus because my tests weren't too deterministic ( a Normal thread priority was used ).

I understand that the subject is very interesting however Intel, Microsoft, GCC, etc., teams spent lots of time on codes' optimization, vectorization, etc. and I personally consider application of DIY functions Not fully effective and it looks like "reinvention of the wheel".

As there is no faster way for memcpy to implement a copy where SSE moves work, you wouldn't expect memcpy to perform any magic other than switching to nontemporal for very long moves.  I don't know whether these memcpy functions would gain benefit from recognizing Haswell and switching to AVX-256.

In a case like the STREAM COPY benchmark, you may improve performance by using an omp parallel for simd loop (preventing memcpy substitution) and allowing -opt-streaming-stores to kick in.

MKL BLAS offers ?copy functions which could be used to avoid the overhead of checking for odd-byte alignment.  These automatically invoke OpenMP threading if you so choose by your combination of link options and run-time environment setting.

>>>movaps xmm0, xmmword ptr [ esi ]>>>

You can use more registers(at maximum two registers will be used per single cycle) to utilize two load Ports. For example: 

movaps xmm1,xmmword ptr [esi+16]

movaps xmm2,xmmword ptr [esi+32]

 

and so on.

I think that there is no need to use ptr size directive when the pointer is passed in esi register and size(128 byte block) is passed in for example ecx register.

>>You can use more registers(at maximum two registers will be used per single cycle) to utilize two load Ports...

An example I've provided is a short version of a real function and it simply demonstrates internals of implementation. It is Not a complete source code of the function.

>>>An example I've provided is a short version of a real function and it simply demonstrates internals of implementation. It is Not a complete source code of the function.>>>

Btw. which memory copy routines did you test?

Inicie sesión para dejar un comentario.