Introduction
Sometimes when we vectorize a loop, we get a high Clocks Per Instruction Retired (CPI) value. This happens when there is high bus utilization and the bus gets saturated.
The subtraction loop in the following code segment has a high Clocks Per Instruction Retired (CPI) when the loop is vectorized.
|
We found by using the VTune analyzer that the hardware prefetcher is working as expected.
The high CPI is due to bus saturation. The test case uses high bus bandwidth (78% in our experiment for the loop). The memory latency increases exponentially as the bus utilization increases.
We found for the test case that if we increase the bus utilization from 70% to 78% the CPI increased from 4 to 6.5.
We used the following events for the experiment:
- CPU_CLK_UNHALTED.CORE
- L2_LINES_IN.SELF.ANY
- INST_RETIRED.ANY
- MEM_LOAD_RETIRED.L2_LINE_MISS
- L2_LINES_IN.SELF.PREFETCH
- BUS_DRDY_CLOCKS.ALL_AGENTS
- BUS_TRANS_ANY.ALL_AGENTS
- CPU_CLK_UNHALTED.BUS
The events are described in VTune™ Analyzer documentation. Also, we used the VTPause() and VTResume() APIs to limit data collection to the loop.
