| December 28, 2008 10:30 AM PST | |
Sometimes when we vectorize a loop, we get a high Clocks Per Instruction Retired (CPI) value. This happens when there is high bus utilization and the bus gets saturated.
The subtraction loop in the following code segment has a high Clocks Per Instruction Retired (CPI) when the loop is vectorized.
// testcase.cpp |
We found by using the VTune analyzer that the hardware prefetcher is working as expected.
The high CPI is due to bus saturation. The test case uses high bus bandwidth (78% in our experiment for the loop). The memory latency increases exponentially as the bus utilization increases.
We found for the test case that if we increase the bus utilization from 70% to 78% the CPI increased from 4 to 6.5.
We used the following events for the experiment:
- CPU_CLK_UNHALTED.CORE
- L2_LINES_IN.SELF.ANY
- INST_RETIRED.ANY
- MEM_LOAD_RETIRED.L2_LINE_MISS
- L2_LINES_IN.SELF.PREFETCH
- BUS_DRDY_CLOCKS.ALL_AGENTS
- BUS_TRANS_ANY.ALL_AGENTS
- CPU_CLK_UNHALTED.BUS
The events are described in VTune™ Analyzer documentation. Also, we used the VTPause() and VTResume() APIs to limit data collection to the loop.
This article applies to: Tools, Xeon, Intel® C++ Compiler for Linux* Knowledge Base, Intel® C++ Compiler for Mac OS X* Knowledge Base, Intel® C++ Compiler for Windows* Knowledge Base, Intel® Software Development Tool Suites for Intel® Atom™ Processor Knowledge Base, Intel® VTune™ Performance Analyzer for Linux* Knowledge Base, Intel® VTune™ Performance Analyzer for Windows* Knowledge Base
For more complete information about compiler optimizations, see our Optimization Notice.

