Identify Long Latency Instruction Impacts

Long latency instructions such as division and square-root can introduce stalls during the execution of an application. The Intel® VTune™ Amplifier XE performance profiler can help software developers analyze their application to identify algorithmic and microarchitectural performance issues. Intel VTuneAmplifier XE uses the processor's Performance Monitoring Unit (PMU) to sample processor events and can be used to statistically sample the number of computational operations.

Intel VTune Amplifier XE can help identify where such operations are taking place and if these operations are contributing to stall cycles during the execution.

Intel® Core™2 processor family (Intel®  Core™2 Duo/Quad, etc)
 
DIV
 
Counts the number of divide operations executed. This includes integer divides, floating point divides and square-root operations executed.
CYCLES_DIV_BUSY
 
Counts the number of cycles the divider is busy executing divide or square root operations. The divide can be integer, X87 or Streaming SIMD Extensions (SSE). The square root operation can be either X87 or SSE.
Intel® Core™ architecture (Intel® Core™ i7, i5, i3; a.k.a. Nehalem)
 
ARITH.DIV
 
Counts number of divide or square root operations. The divide can be integer, X87 or Streaming SIMD Extensions (SSE). The square root operation can be either X87 or SSE.
ARITH. CYCLES_DIV_BUSY
 
Counts the number of cycles the divider is busy executing divide or square root operations. The divide can be integer, X87 or Streaming SIMD Extensions (SSE). The square root operation can be either X87 or SSE.
2nd Generation Intel® Core™ architecture (a.k.a. Sandy Bridge)
 
ARITH.FP_DIV
 
Counts the number of the divide operations executed.
ARITH.FPU_DIV_ACTIVE
 
Counts the cycles when the divider is busy executing divide or square-root operations. An additional constant number of (6) cycles to be added per operation.
Table 1: PMU events used to count the specific events.

 

Example: Let's consider the N Body problem for this exercise. The N-body problem predicts the motion of a group of celestial objects that interact with each other gravitationally. The sample application proceeds over time steps and in each step computes the net force on every body and updates its position, acceleration and velocity accordingly. This implementation requires O(N2) operations in each iteration.

runSerialBodies()

{
...
// Run the simulation over a fixed range of time steps
for (double s = 0.; s < STEPLIMIT; s += TIMESTEP)
{

  // Compute the accelerations of the bodies
  for (i = 0; i < n - 1; ++i)
  {
   for (j = i + 1; j < n; ++j)
   {

     // compute the distance between them
     double dx = body[i].pos[0]-body[j].pos[0];
     double dy = body[i].pos[1]-body[j].pos[1];
     double dz = body[i].pos[2]-body[j].pos[2];

     double distsq = dx*dx + dy*dy + dz*dz;

     if (distsq < MINDIST)distsq = MINDIST;

     double dist = sqrt(distsq);

     // compute the unit vector from j to i
     double ud[3];
     ud[0] = dx / dist;
     ud[1] = dy / dist;
     ud[2] = dz / dist;

     // F = G*mi*mj/distsq, but F = ma, so ai = G*mj/distsq
     double Gdivd = GFORCE/distsq;
     double ai = Gdivd*body[j].mass;
     double aj = Gdivd*body[i].mass;
     // apply acceleration components using unit vectors
     for (int k = 0; k < 3; ++k)
     {
        body[j].acc[k] += aj*ud[k];
        body[i].acc[k] -= ai*ud[k];
      }
    }
  }

  // Apply acceleration and advance bodies
  for (i = 0; i < n; ++i)
  {
    for (j = 0; j < 3; ++j)
    {     
       body[i].vel[j] += body[i].acc[j] * TIMESTEP;
       body[i].pos[j] += body[i].vel[j] * TIMESTEP;
       body[i].acc[j] = 0.;
     }
  }

}
...
}

 

Analyzing the sample code on Intel® Core™ i7 (x980) based system (3.33GHz, 6 core + Intel® Hyper-Threading Technology enabled) with Intel VTune Amplifier XE reveals the following:

fig1.png

Figure 1: Shows the analysis of the sample code, 85% of the clockticks, 89.3% of the uops dispatched and 83.4% of the dispatch stalls are occuring in this code segment. Please check the Intel VTune Amplifier XE help file for more information on the events such as UOPS_DISPATCHED.


One way to optimize the code is to replace the division with reciprocal multiplication as shown below.

// compute the unit vector from j to i

double ud[3];
ud[0] = dx / dist;
ud[1] = dy / dist;
ud[2] = dz / dist;

to

// compute the unit vector from j to i

double ud[3];
double dd = 1.0 / dist;
ud[0] = dx * dd;
ud[1] = dy * dd;
ud[2] = dz * dd; 

The optimized code consumes 4,668 million less clockticks and reduces the dispatch stalls from 7,448 million cycles to 3,020 million cycles.

fig2.png

Figure 2: Comparison of the original and optimized versions.

For more complete information about compiler optimizations, see our Optimization Notice.