I'm optimizing an application using ICC. Vtune hotspot analysis shows that some wrapper functions for the C standard memcpy() function are among the functions with higher CPU time. That's quite unexpected as they only copy small (250-500B) user-defined data structures. The assembly shows a call to _intel_fast_memcpy and a red bar that marks the function poor performance.
In some other cases the memcpy wrapper functions are replaced with some mmx instructions, proving that the SSE4.1 support is enabled.
I was wondering if there are ways to make the poor performing memcpy wrappers more efficient.
I have tried to align the source and destination pointers to 16 bytes through __attribute__((aligned(16))), but that didn't help.