Performance difference between 32bit and 64bit memcpy

Performance difference between 32bit and 64bit memcpy

We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the throughput of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or in fact 2.4GByte/s with the Intel compiler CRT's memcpy).

While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which uses 128-bit wide loads and stores regardless of the 32/64-bitness of the process) demonstrates identical upper limits on the copy bandwidth it achieves

I'm puzzled as to the origin of this difference... Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what

Thanks for any insight.
Tim

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Quoting - Tim Day
We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the throughput of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or in fact 2.4GByte/s with the Intel compiler CRT's memcpy).

While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which uses 128-bit wide loads and stores regardless of the 32/64-bitness of the process) demonstrates identical upper limits on the copy bandwidth it achieves

I'm puzzled as to the origin of this difference... Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what

Thanks for any insight.
Tim

You shouldexpect results like those you obtain with your own code-- that you get identical upper bounds on the copy bandwidth.

I would guess that the different numbers you gotwith32 and 64 bit process are likely to be due to their executing different code. To verify that, one would have to look at the specific implementations. But that is the most likely cause-- different code giving different performance.

For a piece of code with a very simple description, there are a very large number of different ways to write a memcpy(). You can get many different performances, depending on how you structure your code, and what sort of data that you feed to that code. But by and large, the same peak bandwidth should be obtainable in either 32 or 64 bit mode.

Hi Tim,

It looks like the 32-bit implementation of memcpy is far from optimal for the modern CPU you're using. It's probably still compatible with 386 CPUs even, and not taking advantage of MMX or SSE. The 64-bit version knows the CPU supports everything up to at least SSE2 so it can achieve better bandwidth.

In particular I believe the older 32-bit memcpy implementations make use of "rep movsd", which is microcoded on processors with out-of-order instruction execution. If Irecall correctly itgenerates one micro-instruction per cycle so unlike a regular copy loop it won't issue a load and a store in parallel. It might also impose limitations on prefetching and such.

Anyway, what used to be fast on a 386 is no longer optimal today. So if performance is critical I'd certainly advise to use your own memcpy implementation that makes use of all the processor's capabilities.

Cheers,

Nicolas

Nicolas,

Thank you for checking the implementation of memcpy() on the 32 bit system. That was the most likely cause.

However, please let me update you on the REP MOVSD instruction. It is indeed, micro-coded, and there are differences between exactly how it is implemented between different Intel CPUs. However, it can (and does) issue multiple micro-instructions per cycle, and can issue loads and stores at the same time on most Intel CPU products. There are substantial improvements in its performance over the years, with the newest products having the best implementations. It is possible to get better performance with your own home grown code if you do everything right, but it is becoming more challenging to do that as REP MOVSD improves in performance. The latest optimization guide has more information about the performance and usage of this instruction. (see http://www.intel.com/products/processor/manuals/, section 2.2.6)

Thanks Seth, I wasn't aware of the micro-code optimizations.

Quoting - Seth Abraham (Intel)

I would guess that the different numbers you gotwith32 and 64 bit process are likely to be due to their executing different code. To verify that, one would have to look at the specific implementations. But that is the most likely cause-- different code giving different performance.

Yes I've just been digging into this in more depth and finally got to the bottom of it.

In the below, dst and src are 512 MByte std::vector
I'm using the Intel 10.1.029 compiler and CRT on a Dell Precision T5400.

On 64bit both
memcpy(&dst[0],&src[0],dst.size())
and
memcpy(&dst[0],&src[0],N) (where N is previously declared const size_t N=512*(1<<20);)
call
__intel_fast_memcpy
the bulk of which consists of:
000000014004ED80 lea rcx,[rcx+40h]
000000014004ED84 lea rdx,[rdx+40h]
000000014004ED88 lea r8,[r8-40h]
000000014004ED8C prefetchnta [rdx+180h]
000000014004ED93 movdqu xmm0,xmmword ptr [rdx-40h]
000000014004ED98 movdqu xmm1,xmmword ptr [rdx-30h]
000000014004ED9D cmp r8,40h
000000014004EDA1 movntdq xmmword ptr [rcx-40h],xmm0
000000014004EDA6 movntdq xmmword ptr [rcx-30h],xmm1
000000014004EDAB movdqu xmm2,xmmword ptr [rdx-20h]
000000014004EDB0 movdqu xmm3,xmmword ptr [rdx-10h]
000000014004EDB5 movntdq xmmword ptr [rcx-20h],xmm2
000000014004EDBA movntdq xmmword ptr [rcx-10h],xmm3
000000014004EDBF jge 000000014004ED80
and runs at ~2200 MByte/s.

But on 32bit
memcpy(&dst[0],&src[0],dst.size())
calls
__intel_fast_memcpy
the bulk of which consists of
004447A0 sub ecx,80h
004447A6 movdqa xmm0,xmmword ptr [esi]
004447AA movdqa xmm1,xmmword ptr [esi+10h]
004447AF movdqa xmmword ptr [edx],xmm0
004447B3 movdqa xmmword ptr [edx+10h],xmm1
004447B8 movdqa xmm2,xmmword ptr [esi+20h]
004447BD movdqa xmm3,xmmword ptr [esi+30h]
004447C2 movdqa xmmword ptr [edx+20h],xmm2
004447C7 movdqa xmmword ptr [edx+30h],xmm3
004447CC movdqa xmm4,xmmword ptr [esi+40h]
004447D1 movdqa xmm5,xmmword ptr [esi+50h]
004447D6 movdqa xmmword ptr [edx+40h],xmm4
004447DB movdqa xmmword ptr [edx+50h],xmm5
004447E0 movdqa xmm6,xmmword ptr [esi+60h]
004447E5 movdqa xmm7,xmmword ptr [esi+70h]
004447EA add esi,80h
004447F0 movdqa xmmword ptr [edx+60h],xmm6
004447F5 movdqa xmmword ptr [edx+70h],xmm7
004447FA add edx,80h
00444800 cmp ecx,80h
00444806 jge 004447A0
and runs at ~1350 MByte/s only.

HOWEVER,
memcpy(&dst[0],&src[0],N) (where N is previously declared const size_t N=512*(1<<20);)
compiles (on 32bit) to a direct call to
__intel_VEC_memcpy
the bulk of which consists of
0043FF40 movdqa xmm0,xmmword ptr [esi]
0043FF44 movdqa xmm1,xmmword ptr [esi+10h]
0043FF49 movdqa xmm2,xmmword ptr [esi+20h]
0043FF4E movdqa xmm3,xmmword ptr [esi+30h]
0043FF53 movntdq xmmword ptr [edi],xmm0
0043FF57 movntdq xmmword ptr [edi+10h],xmm1
0043FF5C movntdq xmmword ptr [edi+20h],xmm2
0043FF61 movntdq xmmword ptr [edi+30h],xmm3
0043FF66 movdqa xmm4,xmmword ptr [esi+40h]
0043FF6B movdqa xmm5,xmmword ptr [esi+50h]
0043FF70 movdqa xmm6,xmmword ptr [esi+60h]
0043FF75 movdqa xmm7,xmmword ptr [esi+70h]
0043FF7A movntdq xmmword ptr [edi+40h],xmm4
0043FF7F movntdq xmmword ptr [edi+50h],xmm5
0043FF84 movntdq xmmword ptr [edi+60h],xmm6
0043FF89 movntdq xmmword ptr [edi+70h],xmm7
0043FF8E lea esi,[esi+80h]
0043FF94 lea edi,[edi+80h]
0043FF9A dec ecx
0043FF9B jne ___intel_VEC_memcpy+244h (43FF40h)
and runs at ~2100MByte/s.

I withdraw the claim that my own memcpy-like SSE code suffers from a
similar ~1300 MByte bandwidth limit in 32bit builds; I now don't have
any problems getting >2GByte/s on 32 or 64bit; the trick (as the above
results hint) is to use non-temporal ("streaming") stores (e.g
_mm_stream_ps intrinsic).

It seems a bit strange that the 32bit "dst.size()"-invoked memcpy
doesn't eventually call the faster "movnt" version (if you step
into memcpy there is the most incredible amount of CPUID checking
and heuristic logic e.g comparing number of bytes to be copied with
cache size etc before it goes anywhere near your actual data) but
at least I understand the observed behaviour now (and it's
down to simple code differences, not SysWow64 or H/W related
as previously suspected). Arguably a bug in Intel CRT, or maybe
there are good reasons for it being the way it is ?

Tim,

Was the CPUID checking performed in intel_fast_memcpy on every call (not a call once)?

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
Was the CPUID checking performed in intel_fast_memcpy on every call (not a call once)?

No it's just the first memcpy which contains a lot of that sort of stuff. I suspect if I'd called some other CRT function first I'd have seen it there instead.

Subsequent calls to memcpy get to the "meat" much quicker; about 20-30 instructions to hit the main loop, involving tests of stored values __intel_cpu_indicator, __intel_memcpy_mem_ops_method and __intel_memcpy_largest_cache_size (which were presumably all set up by the first call).

What you have observed may have to do with data alignment and with the ability of the compiler to determine the copy size at compile time. I can see that your 64-bit code uses MOVDQU (unaligned move) while 32-bit uses MOVDQA.

If you are making such a large allocation (512MB) it would be wise to use OS memory allocation API (VirtualAlloc() in particular) which returns page-aligned memory pointer. If you have enough memory and your application is "alone" in the system you may as well use VirtualLock() to prevent paging of source and destination buffers but for that you will have to increase process working set size. Compiling with /Qopt-prefetch or using TLB priming in advance may help performance as well.

Finally don't forget to issue the SFENCE instruction after the copy if you are using non-temporal stores, at least the software developer's manual suggests that.

Regards,
Igor Levicki

Quoting - Tim Day
We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the throughput of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or in fact 2.4GByte/s with the Intel compiler CRT's memcpy).

While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which uses 128-bit wide loads and stores regardless of the 32/64-bitness of the process) demonstrates identical upper limits on the copy bandwidth it achieves

I'm puzzled as to the origin of this difference... Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what

Thanks for any insight.
Tim

its probably caused by the cache alignment and how many reads/writes that can happen within a cache boundry
32 bit processes do not have to jump through hoops the addresses are padded to 64 bit addresses

Leave a Comment

Please sign in to add a comment. Not a member? Join today