High Clocks Per Instruction Retired when vectorizing the loop.


Introduction

Sometimes when we vectorize a loop, we get a high Clocks Per Instruction Retired (CPI) value. This happens when there is high bus utilization and the bus gets saturated.

The subtraction loop in the following code segment has a high Clocks Per Instruction Retired (CPI) when the loop is vectorized.

// testcase.cpp

#include "stdio.h"

#include "stdlib.h

int main(int argc, char** argv)

{

short* a = (short*)_mm_malloc(320*256*2*1000, 16); 

short* b = (short*)_mm_malloc(320*256*2*1000, 16); 

short* c = (short*)_mm_malloc(320*256*2*1000, 16); 

for (int i = 0; i < 1000*320*256; i++) 

{

a[i] = (short)(rand()% 16000); 

b[i] = (short)(rand() % 16000); 

}

#pragma unroll(4)

for (int i = 0; i < 400*320*256; i++)

c[i] = a[i] - b[i];

printf("%in", c[200]); // make sure compiler does not eliminate 

// loop using dead code elimination

_mm_free(a); 

_mm_free(b);

_mm_free(c); 

return 0;  

} 

 

We found by using the VTune analyzer that the hardware prefetcher is working as expected.

The high CPI is due to bus saturation. The test case uses high bus bandwidth (78% in our experiment for the loop). The memory latency increases exponentially as the bus utilization increases.

We found for the test case that if we increase the bus utilization from 70% to 78% the CPI increased from 4 to 6.5.

We used the following events for the experiment:

  1. CPU_CLK_UNHALTED.CORE
  2. L2_LINES_IN.SELF.ANY
  3. INST_RETIRED.ANY
  4. MEM_LOAD_RETIRED.L2_LINE_MISS
  5. L2_LINES_IN.SELF.PREFETCH
  6. BUS_DRDY_CLOCKS.ALL_AGENTS
  7. BUS_TRANS_ANY.ALL_AGENTS
  8. CPU_CLK_UNHALTED.BUS

 

The events are described in VTune™ Analyzer documentation. Also, we used the VTPause() and VTResume() APIs to limit data collection to the loop.

 


Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.