Understand bus utilization

Understand bus utilization

Hi,I try to measure the bus utilization on a Xeon 5400 machine, which has 1333MHz FSB and DDR2-667, when I do simple memory copy with 8 threads (the machine has 2 processors and 4 core in each processor). The throughput of memcpy (from one large chunk of memory to another large chunk of memory) is 3000MB/s.
I use oprofile to measureBUS_TRANS_ANY.ALL_AGENTS andCPU_CLK_UNHALTED.BUS. As suggested by Intel Optimization reference manual, the bus utilization can be measured as followBUS_TRANS_ANY.ALL_AGENTS * 2 / CPU_CLK_UNHALTED.BUS *100. When I do it, I only get 66%.
If I do simple memory copy, I should be able to saturate memory bus, right? Why do I only get 66%? Which part goes wrong?

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Da,
Can you look at:
This equation tells you what percent of the time the bus was stalled and unable to accept new transactions.

And also look at:

The BUS_TRAN_ANY.ALL_AGENTS equation really reports the address bus utilization.

The bus can become too congested to accept more traffic.
From my recollection, %utilizations of 60% to 70% are very high. You've probably maxed out the bus at this level of utilization.

This is one of the reasons for moving to NUMA memory, integrated memory controllers and QPI.
The QPI links separate the coherency traffic from the memory traffic.
The local NUMA memory with an integrated memory controller allows for more efficient memory accesses with lower latency and higher bandwidth.

Hello Pat,Thanks for your reply, and sorry for my late response.I measuredBus Not Ready Ratio andData Bus Utilization and they are 7.8% and 35.9%, respectively. It seems to me that these two values are very low.Do you have any comments on them?Thanks,Da

Hello Da,
The 66% utilization you reported before is typical for bus saturation on FSB-type core2 systems.
You can see that the address bus in this case is the limiter.
For the 2 processor system FSB handles a lot of coherency traffic between the processors.
There is even more coherency traffic for 4 processor systems.
This was one of the main reasons for the death of the FSB-based memory systems.
Sorry to not have a better answer for you.

Leave a Comment

Please sign in to add a comment. Not a member? Join today