When we run a benchmark at Intel, one of the quantities we always compute is CPI - Clockticks per Instruction Retired. If we think of 'work getting done' as 'instructions retired', then the clockticks per instructions retired is a measure of how efficiently the CPU can get work done. The lower the CPI, the more work gets done per clocktick. A typical goal of the compiler is to reduce the CPI.
Use two of the fixed counters to compute the CPI: CPU_CLK_UNHALTED.THREAD and INST_RETIRED.ANY. CPU_CLK_UNHALTED.THREAD is incremented at the frequency at which the CPU is running. Work gets done at the CPU frequency, not the TSC (Time Stamp Counter) frequency. On Core2 CPUs, use CPU_CLK_UNHALTED.CORE instead of CPU_CLK_UNHALTED.THREAD.
The CPI is computed as:
CPI = CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY
Measuring the CPI
To use the above method, you need to collect INST_RETIRED.ANY and CPU_CLK_UNHALTED.THREAD. Intel® VTune™ and Intel® Performance Tuning Utility (Intel® PTU) usually collect these events since the events are two of the fixed counters.
If you are using Intel® VTune™ and Intel® Performance Tuning Utility (Intel® PTU) data, be clear whether you are looking at 'samples' or 'events'. You need use 'events'. You can compute the
Events of CPU_CLK_UNHALTED.THREAD = 'samples of CPU_CLK_UNHALTED.THREAD' * sample_after_value (SAV) or
Events of INST_RETIRED.ANY = 'samples of INST_RETIRED.ANY' * sample_after_value (SAV)