I tried to improve performance for memory copy using sse on Xeon 5310 1.6G DDR2 667
here is my code for testing bandwidth for writing ram
the problem is if ecx is set from 0 to 31 (0 to 4kB), the total cost is 1xxx clocks, and when ecx is set to 32 to
64(4kB to 8kB), the cost rises to 6xxx clocks. It seems every 4kB block will cause a worse jump (5xxx clocks).
I tried to prefetch 4kB ahead before the loop, for instance
but each prefetch will cost 5xxx clocks, so it can't help. I also tried to use movntdq, but it got worse.
accroding to the current result, the bandwidth for writing can't exceed 1GB/s. The ram I installed is ddr2 667, I
think it has a theoretical bandwidth of 5GB/s. Is this a OS issue or CPU cache issue? BTW OS is Linux Kernel 2.6.9-
any ideas will be appreciated