I had originally asked this question in a separate Intel community forum (https://communities.intel.com/thread/50808), but it was suggested that I repost here. There is also a stackoverflow question from another user linked in the other posting (http://stackoverflow.com/questions/22793669/poor-memcpy-performance-on-l...) that provides more details on a specific test platform.
To summarize the core question/observation: When benchmarking the memory performance of (pinned) single-threaded operations on large buffers (larger than the last level of cache), we observe substantially lower copy bandwidth on dual-socket E5-26XX and E5-26XX v2 Xeon systems than on other systems tested, including older Westmere systems, i7 CPUs, etc. This result can be seen using CacheBench (http://icl.cs.utk.edu/projects/llcbench/cachebench.html) as shown in the stackoverflow posting. I realize that the aggregate bandwidth numbers can be increased substantially by using mulitple threads pinned to each core on a socket, but I am currently primarily interested in understanding the performance of a single thread. All of the test systems run either CentOS 6.5 or Fedora 19 and all of the dual-socket systems has Supermicro boards.
For some concrete numbers, running 'taskset -c 2 ./cachebench -p -x1 -m26 -d2 -e1' on several systems generates the following copy bandwidth for 64MiB buffers:
- Dual-socket Xeon E5-2650 v2: 6249 MB/s
- Dual-socket Xeon E5-2670: 5896 MB/s
- Dual-socket Xeon X5660: 9283 MB/s
- Core i7-3960X: 11525 MB/s
I can run the tests for longer (-d2 denotes 2 seconds per buffer size), but the trend is clear. Does anyone know why the results for the E5 Xeons lag so far behind other systems?
Thanks and regards,