| March 4, 2009 11:00 PM PST | |
Increase the frequency with which the first-level instruction cache (L1D) is a hit for integer data. This optimization is key to achieving good performance on the Intel® Itanium® processor in many memory-related situations.
Use the Ifetch instruction properly. This has the benefit of allowing the L1D cache to filter requests to the L2. Many L2 conflicts can be avoided by ensuring integer loads hit in the L1D and are thus never seen by the L2. The fewer requests the L2 sees, the fewer requests conflict.
Note that lfetch instructions require careful use. Carelessly placing lfetch instructions may lower performance. Refer to Chapter 6 of the Intel® Itanium® Processor Reference Manual for details regarding the Itanium processor cache structures. The following guidelines were developed with regard to the memory subsystem:
-
The maximum number of outstanding lfetch operations to L3 or memory, the sum of both data and instruction requests, may not exceed 16.
-
lfetch instructions are restricted to only memory ports M0 and M1 while FP loads (not ldfpd or ldfps) can be issued on any of the four memory ports. Therefore, when mixing lfetch instructions with FP loads, lfetch instructions should be scheduled early in issue groups. For example, if two FP loads and an lfetch are to be scheduled in the same cycle, the lfetch should be scheduled in the first bundle so that it will be issued on one of the first two memory ports. If the two FP loads are scheduled first, the hardware will insert an implicit stop before issuing the lfetch instruction.
-
The Itanium processor lfetch.excl instruction will bring data into the L2 cache in the M state. The .excl completer should only be used when the data brought in by the lfetch will shortly be modified by store instructions.
-
The Itanium processor lfetch instructions will not bring the data into the cache if a DTLB entry providing translation and protection information is not available. To ensure that the lfetch instruction completes an HPW walk and possibly generates a TLB translation or protection fault, the .fault completer should be used. Since there may be high cost associated with these events, the .fault completer should not be used for speculative addresses.
-
lfetch instructions may have effects in the cache hierarchy that make their use high-cost. These effects include the following:
-
-
Acquiring L2 resources such as the L2 OzQ.
-
Arbitration for access to the L2 data arrays, thus becoming a candidate for an L2 bank conflict.
-
Recirculation of the lfetch in the case of a secondary L2 miss.
-
The effects of the L2 recirculate for a secondary L2 miss can be mitigated by placing .nt completers on the lfetch. The .nt hints keep the lfetch from causing an L1D fill and allow the lfetch to be removed from the L2 OzQ. However, the non-temporal completer is not absolutely necessary, because the L2 OzQ logic can recognize when any lfetch instruction is a secondary L2 miss and does not perform an L1D fill to prevent it from allocating in the L2 OzQ.
In the case where an lfetch hits the L2, it takes L2 OzQ resources, causes other requests to cancel, and may get canceled itself as if it actually reads the L2 data array regardless of the .nt hint or actual need to fill the L1D.
Intel® Itanium® Processor Reference Manual
For more complete information about compiler optimizations, see our Optimization Notice.

