Minimize inefficiencies due to functional-unit latency stalls. Computers require functional-unit latency stalls to ensure that results are computed correctly. In a chain of instructions, the output of one instruction may be used as the input of another. If there are an insufficient number of instructions (cycles) between the two to absorb the operational latency of generating the data, the second operation must stall until the data is ready. This is called a “scoreboarded” stall.
Many of the integer operations on the Itanium® processor have single-cycle latencies. As one cycle is required between a “generating” instruction and a “consuming” instruction to avoid the “read after write” dependency violation, a one-cycle instruction never causes a functional-unit stall. The most likely sources of these kinds of stall cycles are Multi-Media (MM) integer instructions or floating-point operations.
For integer instructions, functional-unit stalls are explicitly measured by the event BE_EXE_Bubble.GRGR, which counts a subset of the cycles accumulated by BE_EXE_Bubble.GRALL. Therefore, identification of this issue in the execution inefficiency is straightforward. It is most likely to appear in a chain of MM instructions coded with intrinsics.
Interleave other instructions in between to allow the latencies to be absorbed. Floating-point-intensive applications are more likely to encounter functional-unit stalls. This is due to the typical longer latency of the basic floating-point instructions used to build up complex calculations. It is more difficult to uniquely identify functional-unit stalls in floating-point calculations than for integers.
While BE_EXE_Bubble.FRALL counts both floating-point memory-access stalls and functional-unit latency stalls, there is no corresponding .FRFR subevent to distinguish them. To verify what is happening, look at the disassembly listing of the code in the source view displayed by the Intel® VTune™ Performance Analyzer.
If, after analyzing the code, you find that functional-unit latency is the problem, then reorganize the calculations to absorb the latency. This means reducing the number of divisions and square roots (particularly long-latency operations) by collecting intermediate results and regrouping mathematical expressions. The use of lookup tables and interpolation calculations is probably the best way to improve the performance in such a case, if the required accuracy can be maintained. Large performance improvements can be gained by this strategy, in addition to reducing stall cycles. With this strategy, however, there is a trade-off of space for speed.
Introduction to Microarchitectural Optimization for Itanium® Processors