| Last Modified On : | November 18, 2008 5:33 PM PST |
Rate |
|
Sometimes when we vectorize a loop, we get a high Clocks Per Instruction Retired (CPI) value. This happens when there is high bus utilization and the bus gets saturated.
The subtraction loop in the following code segment has a high Clocks Per Instruction Retired (CPI) when the loop is vectorized.
// testcase.cpp |
We found by using the VTune analyzer that the hardware prefetcher is working as expected.
The high CPI is due to bus saturation. The test case uses high bus bandwidth (78% in our experiment for the loop). The memory latency increases exponentially as the bus utilization increases.
We found for the test case that if we increase the bus utilization from 70% to 78% the CPI increased from 4 to 6.5.
We used the following events for the experiment:
The events are described in VTune™ Analyzer documentation. Also, we used the VTPause() and VTResume() APIs to limit data collection to the loop.
