High Clocks Per Instruction Retired when vectorizing the loop.

Submit New Article

Last Modified On :   November 18, 2008 5:33 PM PST
Rate
 



Introduction

Sometimes when we vectorize a loop, we get a high Clocks Per Instruction Retired (CPI) value. This happens when there is high bus utilization and the bus gets saturated.

The subtraction loop in the following code segment has a high Clocks Per Instruction Retired (CPI) when the loop is vectorized.

// testcase.cpp
#include "stdio.h"
#include "stdlib.h
int main(int argc, char** argv)
{
short* a = (short*)_mm_malloc(320*256*2*1000, 16);
short* b = (short*)_mm_malloc(320*256*2*1000, 16);
short* c = (short*)_mm_malloc(320*256*2*1000, 16);
for (int i = 0; i < 1000*320*256; i++)
{
a[i] = (short)(rand()% 16000);
b[i] = (short)(rand() % 16000);
}
#pragma unroll(4)
for (int i = 0; i < 400*320*256; i++)
c[i] = a[i] - b[i];
printf("%i\n", c[200]); // make sure compiler does not eliminate
// loop using dead code elimination
_mm_free(a);
_mm_free(b);
_mm_free(c);
return 0;
}

 

We found by using the VTune analyzer that the hardware prefetcher is working as expected.

The high CPI is due to bus saturation. The test case uses high bus bandwidth (78% in our experiment for the loop). The memory latency increases exponentially as the bus utilization increases.

We found for the test case that if we increase the bus utilization from 70% to 78% the CPI increased from 4 to 6.5.

We used the following events for the experiment:

  1. CPU_CLK_UNHALTED.CORE
  2. L2_LINES_IN.SELF.ANY
  3. INST_RETIRED.ANY
  4. MEM_LOAD_RETIRED.L2_LINE_MISS
  5. L2_LINES_IN.SELF.PREFETCH
  6. BUS_DRDY_CLOCKS.ALL_AGENTS
  7. BUS_TRANS_ANY.ALL_AGENTS
  8. CPU_CLK_UNHALTED.BUS

 

The events are described in VTune™ Analyzer documentation. Also, we used the VTPause() and VTResume() APIs to limit data collection to the loop.

 






This article applies to: Tools,   Xeon,   Intel® C++ Compiler for Linux* Knowledge Base,   Intel® C++ Compiler for Mac OS X* Knowledge Base,   Intel® C++ Compiler for Windows* Knowledge Base,   Intel® Parallel Composer,   Intel® Software Development Tool Suites for Intel® Atom™ Processor Knowledge Base,   Intel® VTune™ Performance Analyzer for Linux* Knowledge Base,   Intel® VTune™ Performance Analyzer for Windows* Knowledge Base