# Identify Long Latency Instruction Impacts

By Levent (Intel), published on June 15, 2012

Long latency instructions such as division and square-root can introduce stalls during the execution of an application. The Intel® VTune™ Amplifier XE performance profiler can help software developers analyze their application to identify algorithmic and microarchitectural performance issues. Intel VTune^{}Amplifier XE uses the processor's Performance Monitoring Unit (PMU) to sample processor events and can be used to statistically sample the number of computational operations.

Intel VTune Amplifier XE can help identify where such operations are taking place and if these operations are contributing to stall cycles during the execution.

Intel® Core™2 processor family (Intel® Core™2 Duo/Quad, etc) | DIV | Counts the number of divide operations executed. This includes integer divides, floating point divides and square-root operations executed. |

CYCLES_DIV_BUSY | Counts the number of cycles the divider is busy executing divide or square root operations. The divide can be integer, X87 or Streaming SIMD Extensions (SSE). The square root operation can be either X87 or SSE. | |

Intel® Core™ architecture (Intel® Core™ i7, i5, i3; a.k.a. Nehalem) | ARITH.DIV | Counts number of divide or square root operations. The divide can be integer, X87 or Streaming SIMD Extensions (SSE). The square root operation can be either X87 or SSE. |

ARITH. CYCLES_DIV_BUSY | Counts the number of cycles the divider is busy executing divide or square root operations. The divide can be integer, X87 or Streaming SIMD Extensions (SSE). The square root operation can be either X87 or SSE. | |

2nd Generation Intel® Core™ architecture (a.k.a. Sandy Bridge) | ARITH.FP_DIV | Counts the number of the divide operations executed. |

ARITH.FPU_DIV_ACTIVE | Counts the cycles when the divider is busy executing divide or square-root operations. An additional constant number of (6) cycles to be added per operation. |

**Table 1:**PMU events used to count the specific events.

**Example:** Let's consider the N Body problem for this exercise. The *N-body* problem predicts the motion of a group of celestial objects that interact with each other gravitationally. The sample application proceeds over time steps and in each step computes the net force on every body and updates its position, acceleration and velocity accordingly. This implementation requires *O(N ^{2})* operations in each iteration.

runSerialBodies() { ... // Run the simulation over a fixed range of time steps for (double s = 0.; s < STEPLIMIT; s += TIMESTEP) { // Compute the accelerations of the bodies for (i = 0; i < n - 1; ++i) { for (j = i + 1; j < n; ++j) { // compute the distance between them double dx = body[i].pos[0]-body[j].pos[0]; double dy = body[i].pos[1]-body[j].pos[1]; double dz = body[i].pos[2]-body[j].pos[2]; double distsq = dx*dx + dy*dy + dz*dz; if (distsq < MINDIST)distsq = MINDIST; double dist = sqrt(distsq); // compute the unit vector from j to i double ud[3]; ud[0] = dx / dist; ud[1] = dy / dist; ud[2] = dz / dist; // F = G*mi*mj/distsq, but F = ma, so ai = G*mj/distsq double Gdivd = GFORCE/distsq; double ai = Gdivd*body[j].mass; double aj = Gdivd*body[i].mass; // apply acceleration components using unit vectors for (int k = 0; k < 3; ++k) { body[j].acc[k] += aj*ud[k]; body[i].acc[k] -= ai*ud[k]; } } } // Apply acceleration and advance bodies for (i = 0; i < n; ++i) { for (j = 0; j < 3; ++j) { body[i].vel[j] += body[i].acc[j] * TIMESTEP; body[i].pos[j] += body[i].vel[j] * TIMESTEP; body[i].acc[j] = 0.; } } } ... }

Analyzing the sample code on Intel® Core™ i7 (x980) based system (3.33GHz, 6 core + Intel® Hyper-Threading Technology enabled) with Intel VTune Amplifier XE reveals the following:

**Figure 1:** Shows the analysis of the sample code, 85% of the clockticks, 89.3% of the uops dispatched and 83.4% of the dispatch stalls are occuring in this code segment. Please check the Intel VTune Amplifier XE help file for more information on the events such as UOPS_DISPATCHED.

One way to optimize the code is to replace the division with reciprocal multiplication as shown below.

// compute the unit vector from j to i double ud[3]; ud[0] = dx / dist; ud[1] = dy / dist; ud[2] = dz / dist;

to

// compute the unit vector from j to i double ud[3]; double dd = 1.0 / dist; ud[0] = dx * dd; ud[1] = dy * dd; ud[2] = dz * dd;

The optimized code consumes 4,668 million less clockticks and reduces the dispatch stalls from 7,448 million cycles to 3,020 million cycles.

**Figure 2:** Comparison of the original and optimized versions.