Sandy bridge performance degradation compare to Westmere

Sandy bridge performance degradation compare to Westmere

i created a simple memtest that allocate large vector that get random number and update the vector data.
pseudo code 

DataCell* dataCells = new DataCell[VECTOR_SIZE]
for(int cycles = 0; cycles < gCycles; cycles++){    u64 randVal = random()
    DataCell* dataCell = dataCells[randVal % VECTOR_SIZE]

    dataCell->m_count = cycles

    dataCell->m_random = randVal

    dataCell->m_flag = 1

}

i'm using perf util to gather performance counter info.
the most interesting results are when the vector size is larger then the cache size tix8 20MB tix2 12MB 

hardware specification

tix2 - cpu X5680 3.33GHz, mother board - Supermicro X8DTU , memory - 64GB divided 32GB to each bank at 1.33GHz

tix8 - cpu E5-2690 2.90GHz, mother board - Intel S2600GZ, memory - 64GB divided 32GB to each bank at 1.60GHz

compiled with gcc 4.6.1 -O3 -mtune=native -march=native

amk@tix2:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 21800971556 nano time 6542908630 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

21842742688 cycles # 0.000 M/sec
5869556879 instructions # 0.269 IPC
1700665337 L1-dcache-loads # 0.000 M/sec
221870903 L1-dcache-load-misses # 0.000 M/sec
1130278738 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

6.628680493 seconds time elapsed

amk@tix8:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 24362574412 nano time 8424126698 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

24409499958 cycles # 0.000 M/sec
5869656821 instructions # 0.240 IPC
1192635035 L1-dcache-loads # 0.000 M/sec
94702716 L1-dcache-load-misses # 0.000 M/sec
1373779283 L1-dcache-stores # 0.000 M/sec
306775598 L1-dcache-store-misses # 0.000 M/sec

8.525456817 seconds time elapsed

what am is missing is Sandy bridge slower then Westmere ???????

Amir.

43 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

I am having similar problems with our own proprietary application. I haven't been able to isolate the problematic code to a nice tidy snippet like you have.

Just looking at your program run times, 8.5 seconds versus 6.6 seconds, that's about a 23% difference.

For the purpose of testing and eliminating variables, I suggest trying the following:
- configure the BIOS in both systems to disable any and all power-saving features (C-states, C1E, memory power saving, etc)
- enable turbo boost on both
- use the idle=poll kernel commandline param on your Sandy Bridge server (needed, as the BIOS settings alone won't keep the CPU from leaving C0 state)

In this setup, you can use the "i7z" program to see what speed all your cores are running at. At least on my systems, taking all the above steps results in all cores constantly running above their "advertised" clock speed, i.e. turbo boost is kicking in.

Yes, this will make the servers run hot and use lots of power. :)

These are tunings for a low-latency environment, but I think they might be appropriate for testing/experimenting in your case. At least, if you do these things, and see the difference between Westmere and Sandy Bridge narrow, then you can attribute it to one of these tweaks. At least in my low-latency world, the aggressive power-saving features are bad for performance. Just a random guess here, but: perhaps your application is such that, during execution, it allows the CPU to drop into some kind of a sleep state many times. There is a latency penalty for coming out of a sleep state. If you drop in and out of sleep states many times during execution, you might see a cumulative effect in increased overall runtime.

all power saving is disabled hyper thread is disabled i7m report cpu frequency of 3290.1 but the performance is even worse

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 25724416756 nano time 21437013963 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

29300656750 cycles # 0.000 M/sec
5869414958 instructions # 0.200 IPC
1190853811 L1-dcache-loads # 0.000 M/sec
94650151 L1-dcache-load-misses # 0.000 M/sec
1379446403 L1-dcache-stores # 0.000 M/sec
306750238 L1-dcache-store-misses # 0.000 M/sec

8.990783606 seconds time elapsed

the results bellow is without turbo boost !!!!

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 24314509968 nano time 8404600749 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

24360101110 cycles # 0.000 M/sec
5869421474 instructions # 0.241 IPC
1191790678 L1-dcache-loads # 0.000 M/sec
94483286 L1-dcache-load-misses # 0.000 M/sec
1374772009 L1-dcache-stores # 0.000 M/sec
306899965 L1-dcache-store-misses # 0.000 M/sec

8.506839690 seconds time elapsed

The web sites appear to confirm that those motherboards are the usual full featured ones, with 8 channels on Sandy Bridge and 6 on Westmere.
My E5-2670 has 1 stick in each channel. I do see lower performance than the 5680 on operations where performance is proportional to clock speed and doesn't need the superior memory system.
I suppose gcc 4.6 doesn't use nontemporal stores directly, and I guess you have excluded use of simd instructions.

The web sites appear to confirm that those motherboards are the usual full featured ones, with 8 channels on Sandy Bridge and 6 on Westmere.
My E5-2670 has 1 stick in each channel. I do see lower performance than the 5680 on operations where performance is proportional to clock speed and doesn't need the superior memory system.
I suppose gcc 4.6 doesn't use nontemporal stores directly, and I guess you have excluded use of simd instructions.

i don't fully understand your answer

first of all we are using E5-2690 is the answer "E5-2670 has 1 stick in each channel" relevant to this cpu ?

what i understand from your answer ("I do see lower performance than the 5680 on operations where performance") is that Sandy bridge (E5-2690) is slower then Westmere (5680) on the pseudo code i wrote previously (i can supply code for this test), and there is nothing i can do in order to solve this issue (change compiler, change compile flags, change bios settings ....)

Hello amk21,
For the pseudo-code where you pick an index into the array... I assume that random() returns something in the range of VECTOR_SIZE.

The test that you've generated is sort of a memory latency test.
I say 'sort of' because the usual latency test uses linked list of dependent addresses (so that only one load is outstanding at a time).
Doing a random list can generate more than one load outstanding at a time.

Do you know if the prefetchers are disabled in the BIOS?
If one system has the prefetchers enabled and another system has them disabled, things can get confusing.

Do you have 2 processors on the system or just 1 chip?
If you have more than 1 chip, do you know if NUMA is enabled on both systems?

For latency tests, it is better to have the prefetchers disabled (just to make thinks simpler).

If both systems are configured optimally, I would expect the sandybridge-based system (tix8) to have lower latency than the westmere-based system (tix2). Optimally means 1 DIMM per slot and numa enabled (if there is more than 1 processor).

Are you running on Windows?
If so, the cpu-z folks have a memory latency tool that you could run to see if their tool get similar results to what you are seeing.
Try running the latency.exe in http://www.cpuid.com/medias/files/softwares/misc/latency.zip
If you could send the output.

On linux you can use lmbench to get latency... see http://sourceforge.net/projects/lmbench/
But I'm not too familiar with lmbench so i can't help too much with running instructions.

Running these industry standard benchmarks will give us more information on the relative performance of the systems.
Pat

i have 2 processors on the system and numa is enabled.

i'll verify tix2 bios setting and run lmbench but i need the simple memtest because it simulate my application i have a very large map that is actually 2 dimension vector and i found out that the finding the right ling in the map is the most costly operation.

"For latency tests, it is better to have the prefetchers disabled" - what bios setting did you had in mind ?
disabled data prefetcher
Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Disabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 24457792172 nano time 8457051235 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

24504834353 cycles # 0.000 M/sec
5869424898 instructions # 0.240 IPC
1193553992 L1-dcache-loads # 0.000 M/sec
94548506 L1-dcache-load-misses # 0.000 M/sec
1370182667 L1-dcache-stores # 0.000 M/sec
306627891 L1-dcache-store-misses # 0.000 M/sec

8.559050619 seconds time elapsed

disabled data prefetcher and numa optimized
Numa optimized - Disabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Disabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 33150418300 nano time 11462800242 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

33191154216 cycles # 0.000 M/sec
5869420947 instructions # 0.177 IPC
1190593871 L1-dcache-loads # 0.000 M/sec
94498148 L1-dcache-load-misses # 0.000 M/sec
1382188152 L1-dcache-stores # 0.000 M/sec
306662218 L1-dcache-store-misses # 0.000 M/sec

11.568955857 seconds time elapsed

disabled numa optimized
Numa optimized - Disabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 33150283136 nano time 11462753504 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

33190933585 cycles # 0.000 M/sec
5869420768 instructions # 0.177 IPC
1190685322 L1-dcache-loads # 0.000 M/sec
94769556 L1-dcache-load-misses # 0.000 M/sec
1382058359 L1-dcache-stores # 0.000 M/sec
306649458 L1-dcache-store-misses # 0.000 M/sec

11.569743183 seconds time elapsed

Hi,

regarding the latency measurement using lmbench.

Build utility called "lat_mem_rd" in the package. Then:

numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024
to measure the memory access latency between NUMA node 0 and 1. The latency test increases the
working set and converges towards the end to the memory latency.

numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
to measure the local memory latency on NUMA node 0.

--
Roman

results for: numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024

tix8 bios setting
Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.149
0.12500 4.982
0.18750 5.461
0.25000 5.746
0.37500 15.573
0.50000 15.997
0.75000 16.331
1.00000 16.418
1.50000 18.140
2.00000 20.505
3.00000 23.942
4.00000 25.148
6.00000 26.235
8.00000 26.562
12.00000 28.049
16.00000 30.442
24.00000 101.998
32.00000 129.382
48.00000 139.500
64.00000 139.948
96.00000 141.216
128.00000 141.265
192.00000 140.899
256.00000 140.582
384.00000 140.045
512.00000 139.745
768.00000 139.379
1024.00000 139.220

amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.000
0.06250 3.000
0.09375 3.000
0.12500 3.005
0.18750 3.290
0.25000 4.042
0.37500 16.192
0.50000 16.536
0.75000 16.592
1.00000 16.844
1.50000 18.993
2.00000 20.285
3.00000 23.431
4.00000 24.892
6.00000 25.694
8.00000 26.324
12.00000 53.074
16.00000 108.794
24.00000 121.599
32.00000 124.198
48.00000 124.514
64.00000 125.408
96.00000 125.025
128.00000 124.773
192.00000 124.447
256.00000 124.205
384.00000 123.776
512.00000 123.546
768.00000 123.323
1024.00000 123.189

results for: numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024

tix8 bios setting
Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.147
0.12500 4.149
0.18750 4.905
0.25000 5.253
0.37500 15.746
0.50000 15.817
0.75000 16.226
1.00000 16.954
1.50000 18.774
2.00000 20.563
3.00000 23.922
4.00000 25.201
6.00000 26.089
8.00000 26.732
12.00000 28.367
16.00000 30.853
24.00000 75.662
32.00000 89.364
48.00000 94.962
64.00000 96.098
96.00000 96.829
128.00000 96.941
192.00000 96.801
256.00000 96.716
384.00000 96.468
512.00000 96.297
768.00000 96.103
1024.00000 95.989

amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.000
0.06250 3.000
0.09375 3.000
0.12500 3.001
0.18750 3.726
0.25000 4.169
0.37500 16.219
0.50000 16.600
0.75000 16.765
1.00000 16.668
1.50000 18.637
2.00000 20.386
3.00000 23.847
4.00000 25.480
6.00000 29.075
8.00000 31.644
12.00000 53.601
16.00000 74.186
24.00000 83.683
32.00000 85.331
48.00000 86.139
64.00000 86.394
96.00000 87.177
128.00000 87.167
192.00000 87.413
256.00000 87.230
384.00000 87.255
512.00000 86.998
768.00000 87.018
1024.00000 86.786

adding graph of lat_mem_rd results

Anlagen: 

AnhangGröße
Herunterladen tix2-tix8-lat-mem-rd.png51.41 KB

Hello amk21,
A coworker made a suggestion...
Sandybridge-EP power management is probably putting the 2nd processor into a low power state.
In this low power state the snoops will take longer since the 2nd processor is running at (probably) a low frequency.
Can you try pinning and running this 'spin loop' program on the 2nd processor when you run the latency program on the 1st processor?
The spin.c program... you'll have to kill it with control-c.

#include
int main(int argc, char **argv)
{
int i=0;
printf("begin spin loop\n");
while(1) {i++;}
printf("i= %d\n", i);
return 0;
}

In order for us to compare your latency numbers with our numbers, you'll need to disable all the prefetchers and enable numa.
But I'd still like to see the impact of the spinner on your 'prefetchers on, numa on' latency.
Pat

Hello Pat,

"disable all the prefetchers " - what setting in the bios are you referring to ?

these are the setting i found in the bios and there state

Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

regrding the power saving (C1, C3 and C6) features all of them are disabled including turbo boost (see results with turbo boost above) .

can you share the reference latency numbers ?

we tried to replace the memory with other memories ....
currently we are using 8GB X 8 (part number ACT8GHR72Q4H1600S CL-11)
we tried to replace it with the memory tix2 is using 4GB * 8 (part number 25L3205 CL-9)

what is the lowest latency memory type and memory setup we can use assuming we need at list 48GB of memory ?

amir

lat_mem_rd results with spin loop running on second cpu

bios settings

Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

turbo boost - disabled

numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024

"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.149
0.12500 4.149
0.18750 4.703
0.25000 5.405
0.37500 15.608
0.50000 15.790
0.75000 16.268
1.00000 17.013
1.50000 18.484
2.00000 20.319
3.00000 23.514
4.00000 25.144
6.00000 26.056
8.00000 26.881
12.00000 28.537
16.00000 32.171
24.00000 75.093
32.00000 89.267
48.00000 94.837
64.00000 95.840
96.00000 96.386
128.00000 95.148
192.00000 95.833
256.00000 95.996
384.00000 95.885
512.00000 95.866
768.00000 95.681
1024.00000 95.692

numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024

"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.654
0.12500 4.528
0.18750 4.906
0.25000 5.292
0.37500 15.496
0.50000 15.884
0.75000 16.188
1.00000 16.672
1.50000 18.687
2.00000 20.402
3.00000 23.885
4.00000 25.180
6.00000 26.213
8.00000 26.657
12.00000 28.135
16.00000 30.406
24.00000 100.014
32.00000 129.697
48.00000 139.414
64.00000 140.246
96.00000 141.176
128.00000 141.207
192.00000 140.858
256.00000 140.624
384.00000 140.085
512.00000 139.800
768.00000 139.485
1024.00000 139.227

Thanks amk21,
When you say "with spin loop running on second cpu" do you mean that you are running the spin loop on one of the cpus on the 2nd processor?
That is, you are running the spin loop with something like "numactl --cpunodebind=1 --membind=1 ./spin" ?
I don't see any difference in the latency with the spin loop versus no spin loop so I'm wondering why.
Pat

the spin loop is running on the second package with "numactl --cpunodebind=1 --membind=1 ./spin"

Ok... I'm going to have to find a box and run on it myself.
I'll let you know what I find.
Pat

Ok... I'm going to have to find a box and run on it myself.
I'll let you know what I find.
Pat

what is the best memory setup and type (cas latency) i should use for low latency if i need at list 48GB ?

I've run on a Sandybridge-EP box now.
I'm kind of confused by your results.
I used an array size of 1 GB, numa enabled, and a dependent load latency test, ran on cpu 3 of the 1st socket.
The cpu speed is 2.7 GHz, the memory is hynix 1600 MHz, HMT31GR7BFR4C-P.

Here is a table of my results.

prefetcher, spin, turbo, latency (nanosec)
off, off, off, 86.487
off, off, on, 79.664
off, on, off, 75.072
off, on, on, 66.470

on, off, off, 11.347
on, off, on, 9.531
on, on, off, 10.684
on, on, on, 8.771

where prefetcher off means
MLC streamer - disabled
MLC spatial prefetcher - disabled
DCU Data prefetcher - disabled
DCU instruction prefetcher - disabled
and prefetcher on means all of the above prefetchers enabled.

'spin on' means running the spin program on cpu 2 of the other socket. 'spin off' means not running the spin program.
'turbo on' means turbo enabled, off means turbo disabled.

So... after all the explanations...
I don't see how you are getting latencies of about 95 ns with the prefetchers enabled.
I get about 8.8-11.3 ns.
Your numbers look like prefetchers are disabled.
So I'm puzzled.
You are enabling/disabling prefetchers using the bios right?
Pat

Hi Pat,

i'll post results base on your instructions shortly.

but i have several question regarding the latency test

"I used an array size of 1 GB, numa enabled, and a dependent load latency test, ran on cpu 3 of the 1st socket.
The cpu speed is 2.7 GHz, the memory is hynix 1600 MHz, HMT31GR7BFR4C-P."

what cpu are you using ? (cpu id, how many cores)
what operating system are use using ?
and most importantly can you share the exact test command (are you using lat_mem_rd)

Regards
Amir

the best results are with prefetcher on, spin - on, turbo - on still slower the tix2 (westmere)

the results are bellow:

spin loop is running with the following command:
numactl --cpunodebind=1 --membind=1 ./spin_loop

prefetcher on, spin - off, turbo - off
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.754
0.12500 4.527
0.18750 4.654
0.25000 5.783
0.37500 15.656
0.50000 16.030
0.75000 16.210
1.00000 16.435
1.50000 18.233
2.00000 20.576
3.00000 23.694
4.00000 25.267
6.00000 26.345
8.00000 26.782
12.00000 28.119
16.00000 31.143
24.00000 74.321
32.00000 89.466
48.00000 94.858
64.00000 95.909
96.00000 96.681
128.00000 96.835
192.00000 96.770
256.00000 96.673
384.00000 96.380
512.00000 96.267
768.00000 96.091
1024.00000 95.831

prefetcher on, spin - on, turbo - off
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.147
0.12500 4.149
0.18750 5.157
0.25000 8.717
0.37500 15.632
0.50000 16.089
0.75000 16.283
1.00000 16.971
1.50000 17.534
2.00000 20.677
3.00000 23.613
4.00000 25.086
6.00000 26.135
8.00000 26.648
12.00000 28.132
16.00000 34.167
24.00000 74.176
32.00000 89.376
48.00000 94.873
64.00000 95.977
96.00000 96.707
128.00000 96.886
192.00000 96.766
256.00000 96.649
384.00000 96.413
512.00000 96.282
768.00000 96.090
1024.00000 95.963

prefetcher on, spin - on, turbo - on

amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.216
0.00098 1.216
0.00195 1.216
0.00293 1.216
0.00391 1.216
0.00586 1.216
0.00781 1.216
0.01172 1.216
0.01562 1.216
0.02344 1.216
0.03125 1.216
0.04688 3.647
0.06250 3.647
0.09375 3.646
0.12500 3.980
0.18750 4.359
0.25000 9.690
0.37500 13.562
0.50000 14.044
0.75000 14.167
1.00000 14.640
1.50000 16.854
2.00000 17.950
3.00000 20.417
4.00000 22.197
6.00000 23.251
8.00000 24.041
12.00000 25.474
16.00000 28.963
24.00000 69.301
32.00000 83.078
48.00000 88.359
64.00000 88.757
96.00000 90.156
128.00000 90.073
192.00000 90.063
256.00000 89.684
384.00000 89.604
512.00000 89.322
768.00000 89.100
1024.00000 88.997

prefetcher on, spin - off, turbo - on
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.216
0.00098 1.215
0.00195 1.215
0.00293 1.215
0.00391 1.216
0.00586 1.215
0.00781 1.215
0.01172 1.215
0.01562 1.215
0.02344 1.216
0.03125 1.216
0.04688 3.646
0.06250 3.647
0.09375 4.092
0.12500 3.981
0.18750 4.670
0.25000 10.002
0.37500 12.216
0.50000 14.128
0.75000 14.218
1.00000 14.486
1.50000 16.220
2.00000 17.906
3.00000 20.454
4.00000 22.157
6.00000 23.333
8.00000 24.089
12.00000 25.540
16.00000 29.205
24.00000 77.929
32.00000 93.621
48.00000 99.870
64.00000 100.751
96.00000 101.882
128.00000 102.161
192.00000 102.135
256.00000 102.118
384.00000 101.911
512.00000 101.754
768.00000 101.620
1024.00000 101.544

prefetcher off, spin - on, turbo - on
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.216
0.00098 1.216
0.00195 1.216
0.00293 1.216
0.00391 1.216
0.00586 1.216
0.00781 1.216
0.01172 1.216
0.01562 1.216
0.02344 1.216
0.03125 1.216
0.04688 3.647
0.06250 3.647
0.09375 3.647
0.12500 8.731
0.18750 8.086
0.25000 10.492
0.37500 12.626
0.50000 14.122
0.75000 14.221
1.00000 14.736
1.50000 16.544
2.00000 17.951
3.00000 20.269
4.00000 21.904
6.00000 23.859
8.00000 24.570
12.00000 25.762
16.00000 29.485
24.00000 69.711
32.00000 82.572
48.00000 88.484
64.00000 88.633
96.00000 90.292
128.00000 90.326
192.00000 90.139
256.00000 89.840
384.00000 89.481
512.00000 89.381
768.00000 89.130
1024.00000 89.065

prefetcher off, spin - off, turbo - on

amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.216
0.00098 1.216
0.00195 1.216
0.00293 1.216
0.00391 1.216
0.00586 1.216
0.00781 1.216
0.01172 1.216
0.01562 1.216
0.02344 1.216
0.03125 1.216
0.04688 3.647
0.06250 3.647
0.09375 3.647
0.12500 4.112
0.18750 7.247
0.25000 8.032
0.37500 12.501
0.50000 14.032
0.75000 14.215
1.00000 14.862
1.50000 16.577
2.00000 17.771
3.00000 20.397
4.00000 21.801
6.00000 23.880
8.00000 24.416
12.00000 26.486
16.00000 34.013
24.00000 76.538
32.00000 93.146
48.00000 99.596
64.00000 100.588
96.00000 101.902
128.00000 102.158
192.00000 102.238
256.00000 102.145
384.00000 101.958
512.00000 101.832
768.00000 101.627
1024.00000 101.539

one more thing i rerun my own simple benchmark (random access to large vector) and the results are still slow above 8 seconds ....

westmere is 6.6 seconds !!!!

the test and make file are attached (just remove the .txt from file names)

Anlagen: 

AnhangGröße
Herunterladen makefile.txt115 Bytes
Herunterladen main.cc.txt3.38 KB

Hello Amir,
You asked:
1) what cpu are you using ? (cpu id, how many cores)
I used a pre-production Sandybridge-EP chip, cpuid.1.eax= 0x206d5 (so ext_model= 0x2d, stepping 0x5). It has 8 cores/16 threads per socket.
2) what operating system are use using ?
Microsoft Windows Server 2008 R2 Enterprise
3) and most importantly can you share the exact test command (are you using lat_mem_rd)
I'm using my own latency utility so the command line won't correspond to lat_mem_rd.
The latency results of my utility agree with the latency of the main public latency utilities (such as cpu-z latency utility).
I'll see if I can get someone to install linux on the box and run lat_mem_rd directly.

But I'm 95% sure that something is wrong.
Here is a short table of your results (using just the 1GB latency #)
row, prefetch, spin, turbo, latency(ns)
1, on, off, off, 95.831
2, on, on, off, 95.963
3, on, off, on, 101.544
4, off, on, on, 89.065
5, off, off, on, 101.539

The latency for the 'only difference is the state of the prefetcher' case (rows 3 and 5) shows 101.544 vs. 101.536.
So prefetcher makes NO difference for a 64 byte stride?
This can't be right.
This IS the test I use to see whether prefetchers are enabled or disabled and, on this system, the prefetchers are disabled, always.
How are you enabling/disabling the prefetchers?
Using the BIOS settings right?
After you make a enable all 4 prefetchers in BIOS and boot, and then reboot, does the BIOS still show the prefetchers enabled?

I'll ask someone to install linux on the box so I can run lat_mem_rd. This will take a while.
But I don't expect the results and conclusions won't change much.
Pat

Hi Pat,

i'm updating the perfetchers using the BIOS and then reboot the computer. next time i'm entering the BIOS i can see that the setting are correct (as i set them before the reboot).

i ordered the exact memory as you are using hopefully i'll have it next week

any other ideas i should check ?

Thanks
Amir

Maybe see if the system vendor has a more recent bios.
Is my cpuid model (0x2d) the same as yours?
Unfortunately, on pre-production chips, they don't put the name string (like E5-2670) in the cpuid info.
I'll check on getting linux on the box.
Pat

regarding the BIOS we already have the newest version

i'm a bit confuse about the cpuid is 0xd as you can see the output of /proc/cpuinfo and cpuid bellow

processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
stepping : 7
cpu MHz : 1200.000
cache size : 20480 KB
physical id : 1
siblings : 8
core id : 7
cpu cores : 8
apicid : 46
initial apicid : 46
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat tpr_shadow vnmi flexpriority ept vpid
bogomips : 5786.05
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

-----------------------------------------------------------------------------------------------------

cpuid
eax in eax ebx ecx edx
00000000 0000000d 756e6547 6c65746e 49656e69
00000001 000206d7 00200800 1fbee3ff bfebfbff
00000002 76035a01 00f0b0ff 00000000 00ca0000
00000003 00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000 00000000
00000005 00000040 00000040 00000003 00021120
00000006 00000077 00000002 00000009 00000000
00000007 00000000 00000000 00000000 00000000
00000008 00000000 00000000 00000000 00000000
00000009 00000001 00000000 00000000 00000000
0000000a 07300803 00000000 00000000 00000603
0000000b 00000000 00000000 0000005f 00000000
0000000c 00000000 00000000 00000000 00000000
0000000d 00000000 00000000 00000000 00000000
80000000 80000008 00000000 00000000 00000000
80000001 00000000 00000000 00000001 2c100800
80000002 20202020 49202020 6c65746e 20295228
80000003 6e6f6558 20295228 20555043 322d3545
80000004 20303936 20402030 30392e32 007a4847
80000005 00000000 00000000 00000000 00000000
80000006 00000000 00000000 01006040 00000000
80000007 00000000 00000000 00000000 00000100
80000008 0000302e 00000000 00000000 00000000

Vendor ID: "GenuineIntel"; CPUID level 13

Intel-specific functions:
Version 000206d7:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 13 -
Stepping 7
Reserved 8

Extended brand string: " Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz"
CLFLUSH instruction cache line size: 8
Hyper threading siblings: 32

Feature flags bfebfbff:
FPU Floating Point Unit
VME Virtual 8086 Mode Enhancements
DE Debugging Extensions
PSE Page Size Extensions
TSC Time Stamp Counter
MSR Model Specific Registers
PAE Physical Address Extension
MCE Machine Check Exception
CX8 COMPXCHG8B Instruction
APIC On-chip Advanced Programmable Interrupt Controller present and enabled
SEP Fast System Call
MTRR Memory Type Range Registers
PGE PTE Global Flag
MCA Machine Check Architecture
CMOV Conditional Move and Compare Instructions
FGPAT Page Attribute Table
PSE-36 36-bit Page Size Extension
CLFSH CFLUSH instruction
DS Debug store
ACPI Thermal Monitor and Clock Ctrl
MMX MMX instruction set
FXSR Fast FP/MMX Streaming SIMD Extensions save/restore
SSE Streaming SIMD Extensions instruction set
SSE2 SSE2 extensions
SS Self Snoop
HT Hyper Threading
TM Thermal monitor
31 reserved

TLB and cache info:
5a: unknown TLB/cache descriptor
03: Data TLB: 4KB pages, 4-way set assoc, 64 entries
76: unknown TLB/cache descriptor
ff: unknown TLB/cache descriptor
b0: unknown TLB/cache descriptor
f0: unknown TLB/cache descriptor
ca: unknown TLB/cache descriptor
Processor serial: 0002-06D7-0000-0000-0000-0000

The cpuid signature is cpuid.1.eax (input value=1, output register eax).
In your data above, the signature is 000206d7.
The model is 0xd. The extended model is 0x2d. the family is 0x6. The stepping is 0x7.
So we are using the same chip but your chip is 2 steppings after my chip.
You can see the explanation of model, extended model, etc in Intel CPUID app note at http://www.intel.com/content/www/us/en/processors/processor-identificati...
Pat

could this explain the performance issues ?

I doubt it...

Can you try running the prefetcher enabled and prefetcher disabled on sandybridge EP again, using lat_mem_rd without the '-t' option.
The '-t' option says to 'thrash' memory, so it doesn't really do (near as I can tell) sequential 64 byte strides.
It would be good to run on it the westmere-based EP box too.
Sorry for all the email/forum thrashing.
Pat

prefetcher on, spin - off, turbo - on
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.215
0.00098 1.215
0.00195 1.215
0.00293 1.215
0.00391 1.215
0.00586 1.215
0.00781 1.215
0.01172 1.215
0.01562 1.215
0.02344 1.216
0.03125 1.216
0.04688 3.646
0.06250 3.647
0.09375 3.653
0.12500 3.663
0.18750 3.697
0.25000 3.700
0.37500 4.947
0.50000 4.944
0.75000 4.961
1.00000 4.952
1.50000 4.971
2.00000 4.967
3.00000 5.073
4.00000 5.079
6.00000 5.077
8.00000 5.073
12.00000 5.073
16.00000 5.083
24.00000 9.123
32.00000 9.333
48.00000 9.401
64.00000 9.399
96.00000 9.396
128.00000 9.398
192.00000 9.398
256.00000 9.396
384.00000 9.396
512.00000 9.398
768.00000 9.396
1024.00000 9.397

prefetcher on, spin - on, turbo - on
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.215
0.00098 1.215
0.00195 1.215
0.00293 1.215
0.00391 1.215
0.00586 1.215
0.00781 1.215
0.01172 1.215
0.01562 1.215
0.02344 1.215
0.03125 1.216
0.04688 3.647
0.06250 3.647
0.09375 3.649
0.12500 3.665
0.18750 3.695
0.25000 3.699
0.37500 4.981
0.50000 4.977
0.75000 4.976
1.00000 4.972
1.50000 4.971
2.00000 4.973
3.00000 5.091
4.00000 5.092
6.00000 5.092
8.00000 5.098
12.00000 5.095
16.00000 5.097
24.00000 8.525
32.00000 8.722
48.00000 8.791
64.00000 8.787
96.00000 8.787
128.00000 8.787
192.00000 8.790
256.00000 8.786
384.00000 8.788
512.00000 8.781
768.00000 8.795
1024.00000 8.784

prefetcher off, spin - off, turbo - on
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.215
0.00098 1.215
0.00195 1.215
0.00293 1.215
0.00391 1.215
0.00586 1.215
0.00781 1.215
0.01172 1.215
0.01562 1.215
0.02344 1.215
0.03125 1.216
0.04688 3.646
0.06250 3.646
0.09375 3.647
0.12500 3.649
0.18750 7.202
0.25000 8.710
0.37500 12.205
0.50000 12.205
0.75000 12.209
1.00000 12.208
1.50000 12.210
2.00000 12.210
3.00000 12.279
4.00000 12.279
6.00000 12.278
8.00000 12.278
12.00000 12.349
16.00000 18.538
24.00000 61.805
32.00000 77.941
48.00000 82.244
64.00000 82.315
96.00000 82.262
128.00000 82.311
192.00000 82.304
256.00000 82.307
384.00000 82.319
512.00000 82.325
768.00000 82.333
1024.00000 82.328

prefetcher off, spin - on, turbo - on
amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.215
0.00098 1.215
0.00195 1.215
0.00293 1.215
0.00391 1.215
0.00586 1.215
0.00781 1.215
0.01172 1.215
0.01562 1.215
0.02344 1.215
0.03125 1.216
0.04688 3.646
0.06250 3.646
0.09375 3.646
0.12500 3.646
0.18750 7.204
0.25000 6.586
0.37500 12.211
0.50000 12.210
0.75000 12.210
1.00000 12.210
1.50000 12.209
2.00000 12.209
3.00000 12.277
4.00000 12.278
6.00000 12.277
8.00000 12.277
12.00000 12.287
16.00000 23.342
24.00000 54.411
32.00000 66.018
48.00000 69.587
64.00000 69.298
96.00000 69.551
128.00000 69.295
192.00000 69.205
256.00000 69.046
384.00000 69.004
512.00000 68.928
768.00000 68.905
1024.00000 68.874

Thanks Amir,
Below is shorter version of your results of running lat_mem_rd without the -t option.
These number are about what I got on my SNB-EP system.

Can you run the same tests (lat_mem_rd without the -t option) on the westmere-based system please?
Then we'll have a pretty complete set of data to investigate.

SNB-EP
prefetch, spin, turbo, latency(ns)
on, on, on, 8.784
on, off, on, 9.397
off, on, on, 68.874
off, off, on, 82.328

Thanks,
Pat

prefetcher on, spin - off
amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.000
0.06250 3.000
0.09375 3.002
0.12500 3.014
0.18750 3.044
0.25000 3.066
0.37500 3.916
0.50000 3.916
0.75000 3.916
1.00000 3.916
1.50000 3.916
2.00000 3.917
3.00000 3.956
4.00000 3.957
6.00000 3.957
8.00000 3.957
12.00000 5.359
16.00000 7.706
24.00000 8.369
32.00000 8.544
48.00000 8.553
64.00000 8.459
96.00000 8.537
128.00000 8.518
192.00000 8.451
256.00000 8.555
384.00000 8.574
512.00000 8.512
768.00000 8.516
1024.00000 8.552

prefetcher on, spin - on
amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.000
0.06250 3.000
0.09375 3.000
0.12500 3.001
0.18750 3.055
0.25000 3.067
0.37500 3.916
0.50000 3.917
0.75000 3.916
1.00000 3.916
1.50000 3.916
2.00000 3.917
3.00000 3.957
4.00000 3.956
6.00000 3.957
8.00000 3.957
12.00000 5.311
16.00000 7.582
24.00000 8.327
32.00000 8.522
48.00000 8.535
64.00000 8.476
96.00000 8.526
128.00000 8.547
192.00000 8.494
256.00000 8.503
384.00000 8.521
512.00000 8.473
768.00000 8.472
1024.00000 8.482

prefetcher off, spin - off
amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.000
0.06250 3.000
0.09375 3.001
0.12500 3.000
0.18750 3.000
0.25000 3.004
0.37500 15.026
0.50000 15.026
0.75000 15.026
1.00000 15.026
1.50000 15.026
2.00000 15.027
3.00000 15.099
4.00000 15.100
6.00000 15.099
8.00000 15.573
12.00000 30.821
16.00000 59.533
24.00000 66.361
32.00000 67.914
48.00000 67.940
64.00000 67.909
96.00000 67.985
128.00000 67.943
192.00000 67.890
256.00000 67.845
384.00000 67.834
512.00000 67.831
768.00000 67.807
1024.00000 67.803

prefetcher off, spin - on
amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.001
0.06250 3.000
0.09375 3.000
0.12500 3.000
0.18750 3.000
0.25000 6.381
0.37500 15.026
0.50000 15.027
0.75000 15.027
1.00000 15.027
1.50000 15.027
2.00000 15.027
3.00000 15.100
4.00000 15.099
6.00000 15.099
8.00000 15.099
12.00000 34.131
16.00000 59.888
24.00000 66.616
32.00000 67.897
48.00000 67.942
64.00000 67.927
96.00000 67.987
128.00000 67.922
192.00000 67.890
256.00000 67.858
384.00000 67.828
512.00000 67.813
768.00000 67.811
1024.00000 67.799

prefetcher off, spin - off, turbo - on
amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 25692816948 nano time 21410680790 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

29266867942 cycles # 0.000 M/sec
5869414932 instructions # 0.201 IPC
1190560945 L1-dcache-loads # 0.000 M/sec
94484644 L1-dcache-load-misses # 0.000 M/sec
1380084899 L1-dcache-stores # 0.000 M/sec
306792765 L1-dcache-store-misses # 0.000 M/sec

8.992366456 seconds time elapsed

prefetcher off, spin - on, turbo - on
amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 23101364825 nano time 19251137354 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

26324320151 cycles # 0.000 M/sec
5869414840 instructions # 0.223 IPC
1191849861 L1-dcache-loads # 0.000 M/sec
94700405 L1-dcache-load-misses # 0.000 M/sec
1374421515 L1-dcache-stores # 0.000 M/sec
306818694 L1-dcache-store-misses # 0.000 M/sec

8.084655008 seconds time elapsed

prefetcher on, spin - off, turbo - on
amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 25695167924 nano time 21412639936 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

29269690219 cycles # 0.000 M/sec
5869414932 instructions # 0.201 IPC
1190419590 L1-dcache-loads # 0.000 M/sec
94353778 L1-dcache-load-misses # 0.000 M/sec
1380351700 L1-dcache-stores # 0.000 M/sec
306799151 L1-dcache-store-misses # 0.000 M/sec

8.988294003 seconds time elapsed

prefetcher on, spin - on, turbo - on
amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 23141081595 nano time 19284234662 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

26370388712 cycles # 0.000 M/sec
5869414841 instructions # 0.223 IPC
1192518843 L1-dcache-loads # 0.000 M/sec
94714263 L1-dcache-load-misses # 0.000 M/sec
1372891292 L1-dcache-stores # 0.000 M/sec
306802914 L1-dcache-store-misses # 0.000 M/sec

8.094156100 seconds time elapsed

prefetcher on, spin - off
amk@tix2:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 21767099764 nano time 6530783007 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

21809907154 cycles # 0.000 M/sec
5869320950 instructions # 0.269 IPC
1700577108 L1-dcache-loads # 0.000 M/sec
222105053 L1-dcache-load-misses # 0.000 M/sec
1130245449 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

6.617275213 seconds time elapsed

prefetcher on, spin - on
amk@tix2:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 21759419860 nano time 6528478805 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

21803939087 cycles # 0.000 M/sec
5869320950 instructions # 0.269 IPC
1700577108 L1-dcache-loads # 0.000 M/sec
221837111 L1-dcache-load-misses # 0.000 M/sec
1130245449 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

6.614343550 seconds time elapsed

Hello Amir.
Below I summarize our results so far.
Looking at table 1, it seems that the latency of your wsm (85.707ns) and snb (89.065ns) systems is about the same.
The frequency of your wsm system is about 1.148x higher than the snb box.
In your memtest main.cpp, it seems like the 2 main components of the time are a) the random number genration and b) the loading of the random memory location.
Given that your memory latencies look about equal, I wonder how much of the difference is due to the higher wsm frequency.
If you want to test this, there are 2 ways:
1) change the frequency of the cpus (see the attached how_to_change_frequency_on_linux_pub.txt file) or
2) move the 'generate the random numbers' out of the timing loop.
For 2), you can see my win_main.cpp which is a modified for windows version of your main.cpp. I put the random numbers into an array.

I'm sorry that the 2 systems I used were not more similar and that they were not linux.
Pat

Amir's 2 systems:
wsm-ep tix2 - cpu X5680 3.33GHz, mother board - Supermicro X8DTU , memory - 64GB divided 32GB to each bank at 1.33GHz
snb-ep tix8 - cpu E5-2690 2.90GHz, mother board - Intel S2600GZ, memory - 64GB divided 32GB to each bank at 1.60GHz
Frequency ratio wsm/snb = 1.148x

Table 1 below
Amir running lmbench lat_mem_rd -t (random memory accesses)
system prefetch spin turbo random latency(ns) Best snb/wsm
snb off on on on 89.065 1.039179997x
wsm off ? ? on 85.707 via private msg

Table 2 below
Amir running his memtest microkernel
system prefetch spin turbo random time(secs) Best snb/wsm
snb off on on on 8.084655008
snb off off on on 8.992366456
wsm on off ? on 6.617275213 1.221749851x

Pat's systems:
wsm-ep - cpu L5640 @ 2.27GHz, mother board - Intel S5500WB, memory - 12GB total divided 2GB per channel, 3 DIMMs per node at 1.33GHz
snb-ep - cpu @ 2.70GHz, cpuid signature 0x206d5, mother board - ASUSTek Z9PP-D24, memory - 64GB total divided 8GB per channel, 4 DIMMs per node at 1.60GHz
Frequency ratio wsm/snb = 1.189x

Table 3 below
Pat running a modified version of Amir's memtest
modified memtest now generates random numbers outside of timing loop
system prefetch spin turbo random time(secs) Best snb/wsm
snb off on on on 6.41873
wsm off on on on 7.02422 1.094331745x

Table 4 below.
Pat running a memory latency test with a random memory access
system prefetch spin turbo random latency(ns) Best wsm/snb
snb off off on on 96.714
snb off on on on 87.844
wsm off off on on 99.976 1.138108465x

Hi Pat,
i added the following data in attached file due to forum misbehave (deleting space)

I think that you found the problem ....

i made some more tests base on your instruction to separate the random call from the memory access.
test 0 - original test one loop with random and memory access base on the random
test 1 - separate the random from the memory access by running 2 loops one calling random and storing it in vector and another loop getting the random number from vector and accessing the large vector
test 2 - only random call
test 3 - same as 1 + calling random inside the second loop and storing it's value

results

test 0

amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 0
total Time (rdtsc) 25694935989 nano time 21412446657 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 0':

29269105173 cycles # 0.000 M/sec
5869651728 instructions # 0.201 IPC
1190715446 L1-dcache-loads # 0.000 M/sec
94428062 L1-dcache-load-misses # 0.000 M/sec
1380723917 L1-dcache-stores # 0.000 M/sec
306786820 L1-dcache-store-misses # 0.000 M/sec

8.986841975 seconds time elapsed

amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 0
total Time (rdtsc) 21768192992 nano time 6531111008 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 0':

21811501362 cycles # 0.000 M/sec
5869557137 instructions # 0.269 IPC
1700665349 L1-dcache-loads # 0.000 M/sec
221906581 L1-dcache-load-misses # 0.000 M/sec
1130278735 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

6.616472433 seconds time elapsed

test 1

amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 1
total Time (rdtsc) 5628116479 nano time 4690097065 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1':

9058106175 cycles # 0.000 M/sec
6269846648 instructions # 0.692 IPC
1499386470 L1-dcache-loads # 0.000 M/sec
99173796 L1-dcache-load-misses # 0.000 M/sec
1253847318 L1-dcache-stores # 0.000 M/sec
323522565 L1-dcache-store-misses # 0.000 M/sec

3.099189926 seconds time elapsed

amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 1
total Time (rdtsc) 6830226432 nano time 2049272856 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1':

9913974959 cycles # 0.000 M/sec
6269752348 instructions # 0.632 IPC
1700860367 L1-dcache-loads # 0.000 M/sec
235049597 L1-dcache-load-misses # 0.000 M/sec
1230473719 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

3.263528592 seconds time elapsed

test 2

amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 2
total Time (rdtsc) 2186068316 nano time 1821723596 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 2':

2553096540 cycles # 0.000 M/sec
4869650845 instructions # 1.907 IPC
1206799012 L1-dcache-loads # 0.000 M/sec
92179 L1-dcache-load-misses # 0.000 M/sec
830236117 L1-dcache-stores # 0.000 M/sec
6999 L1-dcache-store-misses # 0.000 M/sec

0.860435270 seconds time elapsed

amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 2
total Time (rdtsc) 2397898132 nano time 719441383 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 2':

2462879102 cycles # 0.000 M/sec
4869556479 instructions # 1.977 IPC
1600664741 L1-dcache-loads # 0.000 M/sec
34091 L1-dcache-load-misses # 0.000 M/sec
830278129 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

0.805532006 seconds time elapsed

test 3

amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 3
total Time (rdtsc) 25908789550 nano time 21590657958 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3':

32109608407 cycles # 0.000 M/sec
11066621621 instructions # 0.345 IPC
2541732011 L1-dcache-loads # 0.000 M/sec
95383828 L1-dcache-load-misses # 0.000 M/sec
2284592360 L1-dcache-stores # 0.000 M/sec
306323402 L1-dcache-store-misses # 0.000 M/sec

10.108541779 seconds time elapsed

amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 3
total Time (rdtsc) 21862175024 nano time 6559308438 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3':

24521143944 cycles # 0.000 M/sec
11066527039 instructions # 0.451 IPC
3400860832 L1-dcache-loads # 0.000 M/sec
235216594 L1-dcache-load-misses # 0.000 M/sec
2030474184 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

7.650797013 seconds time elapsed

i made some more runs comparing test 1 and test 3 using perf

amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 1
total Time (rdtsc) 5632905513 nano time 4694087927 vector size 240000456
Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1':

9061230941 cycles # 0.000 M/sec
6269846648 instructions # 0.692 IPC
1518818209 L1-dcache-loads # 0.000 M/sec
99191827 L1-dcache-load-misses # 0.000 M/sec
1253671516 L1-dcache-stores # 0.000 M/sec
323370488 L1-dcache-store-misses # 0.000 M/sec
7318275 L1-icache-loads # 0.000 M/sec
7262 L1-icache-load-misses # 0.000 M/sec

3.100873645 seconds time elapsed

amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 3
total Time (rdtsc) 26078419396 nano time 21732016163 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3':

32296048271 cycles # 0.000 M/sec
11066621630 instructions # 0.343 IPC
2534574377 L1-dcache-loads # 0.000 M/sec
95835472 L1-dcache-load-misses # 0.000 M/sec
2279501544 L1-dcache-stores # 0.000 M/sec
306153140 L1-dcache-store-misses # 0.000 M/sec
385461391 L1-icache-loads # 0.000 M/sec
12590 L1-icache-load-misses # 0.000 M/sec

10.168108997 seconds time elapsed

amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 1
total Time (rdtsc) 6824750972 nano time 2047630054 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1':

9491721818 cycles # 0.000 M/sec (scaled from 62.32%)
6285584526 instructions # 0.662 IPC (scaled from 75.06%)
1705801008 L1-dcache-loads # 0.000 M/sec (scaled from 75.16%)
119200389 L1-dcache-load-misses # 0.000 M/sec (scaled from 75.16%)
1232065739 L1-dcache-stores # 0.000 M/sec (scaled from 75.16%)
62005805 L1-dcache-store-misses # 0.000 M/sec (scaled from 75.16%)
2020404631 L1-icache-loads # 0.000 M/sec (scaled from 49.69%)
951758 L1-icache-load-misses # 0.000 M/sec (scaled from 49.69%)

3.139811349 seconds time elapsed

amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 3
total Time (rdtsc) 21801073812 nano time 6540976241 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3':

24523308462 cycles # 0.000 M/sec (scaled from 62.32%)
11103166989 instructions # 0.453 IPC (scaled from 74.90%)
3406669306 L1-dcache-loads # 0.000 M/sec (scaled from 75.03%)
117732566 L1-dcache-load-misses # 0.000 M/sec (scaled from 75.11%)
2031002083 L1-dcache-stores # 0.000 M/sec (scaled from 75.11%)
62171871 L1-dcache-store-misses # 0.000 M/sec (scaled from 75.11%)
3641162479 L1-icache-loads # 0.000 M/sec (scaled from 49.86%)
956647 L1-icache-load-misses # 0.000 M/sec (scaled from 49.79%)

7.632643782 seconds time elapsed

Node Tix2 Tix8
test 1 3 3/1 1 3 3/1
cycles 9491721818 24523308462 2.58365225321862 9061230941 32296048271 3.56420098784457
instructions 6285584526 11103166989 1.76644939592687 6269846648 11066621630 1.7650545940434
L1-dcache-loads 1705801008 3406669306 1.99710827348743 1518818209 2534574377 1.66878060980634
L1-dcache-load-misses 119200389 117732566 0.987686088843217 99191827 95835472 0.966162988408309
L1-dcache-stores 1232065739 2031002083 1.6484526910459 1253671516 2279501544 1.81826061684247
L1-dcache-store-misses 62005805 62171871 1.00267823311059 323370488 306153140 0.94675658837488
L1-icache-loads 2020404631 3641162479 1.80219468077432 7318275 385461391 52.6710722130557
L1-icache-load-misses 951758 956647 1.00513680998741 7262 12590 1.7336821812173

please lookat L1-icache-loads in sandy bridge

Regards
Amir

Anlagen: 

AnhangGröße
Herunterladen memtest-perf.txt11.89 KB

引文:

Patrick Fay (Intel) 写道:

1) change the frequency of the cpus (see the attached how_to_change_frequency_on_linux_pub.txt file) or

Hi Pat,

That file wasn't attached in your previous message. Could you please post it?

Thank you!
Matt

Try #2 at attaching how_to_change_frequency_on_linux_pub.txt

Anlagen: 

Try #2 at attaching win_main.cpp

Anlagen: 

AnhangGröße
Herunterladen win-main.cpp4.7 KB

attached memtest

Anlagen: 

AnhangGröße
Herunterladen main.cc.txt5.58 KB

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen