How to use memory bandwidth analysis on old processors?

Intel(R) VTune(TM) Amplifier XE supports memory bandwidth analysis on recent Sandbridge, Ivybridge, and Haswell processors. However if the user worked on some old processors, for example – Nehalem, Westmere-DP, will receive error message such as:

# amplxe-cl -collect wsmex-write-bandwidth -duration 10

amplxe: Fatal error: This analysis type is only defined for Intel processors code name Beckton or Eagleton.

Memory bandwidth analysis is key feature in VTune? Amplifier XE, which will use uncore events named  UNC_IMC_WRITES.FULL.ANY and UNC_IMC_NORMAL_READS.ANY to gather performance data of memory read/write via IMC (Integrated Memory Controller), those events are not based on specific core, so they are doing event-based samplings in counting mode. Data collector only records the event counts but not (cannot) record where events happened (in which core?). This is also very helpful for the user to know overall data throughput (of interacting memory), per second when program is running.

What is idea to get those data on Nehalem and Westmere-DP platform?

There are two workarounds for your choosing:

1. PTU (Performance Tuning Utility) plus core batch can solve this issue. PTU is old experimental tool from Intel but now it is EOL, and it is not downloadable again. However if you have its old version, for example PTU 3.2 Update 1, you can download lin_measurebw.tar.gz from this article, then follow below steps to do bandwidth analysis

1) Extract PTU package and no need to install.

2) Go PTU/vdk/src, build vtune driver and install it.

3) Extract patch file, and go "uncore" directory, do 

#./uncore.sh 

Please enter the path to PTU 3.2 [/opt/intel/ptu32_001_lin_intel64]: /home/peter/ptu32_001_lin_intel64

Measurement complete.  See bandwidth.txt for results.

Press enter to exit.

4) Review output result named bandwidth.txt

Note that PTU is old product, and tested on old OSs only.Usually Linux* kernel version 2.6.18 is recommended to use PTU 3.2.

2. Use Intel PCM  to solve this problem. PCM is simple utility to create an architecturally-defined approach for software agents to interacting with the PMU of processor. Here is the example to use PMU on Linux*

1) Extract zip into IntelPerformanceCounterMonitorV2.5 directory

2) #make ; build all utilities

3) Run an program in one console, for example - #nbench-2.1/nbench

4) Run an utility in another console to monitor performance, for example - # ./pcm.x 1 -nc -ns.  the utility will display - 

EXEC  : instructions per nominal CPU cycle

 IPC   : instructions per CPU cycle

 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)

 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)

 L3MISS: L3 cache misses

 L2MISS: L2 cache misses (including other core's L2 cache *hits*)

 L3HIT : L3 cache hit ratio (0.00-1.00)

 L2HIT : L2 cache hit ratio (0.00-1.00)

 L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency

 L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)

 READ  : bytes read from memory controller (in GBytes)

 WRITE : bytes written to memory controller (in GBytes)

 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE | TEMP

-------------------------------------------------------------------------------------------------------------------

 TOTAL  *     0.16   1.12   0.14    1.12     196 K    516 K    0.62    0.95    0.01    0.00    0.17    0.00     N/A

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.