We have Core2 machines (Dell T5400) with XP64.
We observe that when running 32-bit processes, the throughput of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or in fact 2.4GByte/s with the Intel compiler CRT's memcpy).
While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which uses 128-bit wide loads and stores regardless of the 32/64-bitness of the process) demonstrates identical upper limits on the copy bandwidth it achieves
I'm puzzled as to the origin of this difference... Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what
Thanks for any insight.