Handle Streaming Data Optimally on 64-Bit Architecture

Submit New Article

March 2, 2009 11:00 PM PST



Challenge

Handle long, high-bandwidth data streams optimally with the Intel® Itanium® processor. Proper utilization of the lfetch instructions is vital to optimal handling of streaming data.


Solution

Make use of floating-point data streams, integer data streams, and store data streams, according to the needs of a particular implementation. The following discussion outlines some of the benefits and costs of each of these techniques:

  • Floating-Point Data Streams: Floating-point data resides in the L2 cache. Here, the lfetch.fault.nt1 instruction should be issued only once per L2 cache line for the source, and the lfetch.fault.excl.nt1 instruction should be issued only once per L2 cache line for the destination. The .fault completer is used to ensure that the data enters into the cache hierarchy, even if it results in an L2 DTLB miss or VHPT miss. The .nt1 completer ensures that the floating-point data will not displace data residing in the L1D. The .nt1 completer also allows an lfetch instruction that is a secondary L2 miss to avoid allocation in the L2 OzQ. This is important for situations where the design of the data-streaming code cannot avoid additional requests to an L2 line without performance loss. The .excl completer for the destination stream will ensure that the data is ready to be modified.

When data is accessed as an L2 hit, care should be taken to avoid L2 bank conflicts among request groups. This is necessary to ensure that L2 five-cycle and seven-cycle bypasses are available. Latency is not generally a concern for floating-point code, although in streaming situations, the lifetime of an operation in the L2 OzQ, coupled with the size of the OzQ, may cause core stalls from the L2 control logic to think the OzQ is full. A lower latency means a shorter lifetime in the OzQ, and effectively more OzQ entries are available.

  • Integer Data Streams: Integer data streams are more complicated than floating-point streams, because in some instances, getting the data into the L1D will be important for performance. Streaming from the L1D presents several problems. First, each load operation hits in the L1D and requires integer register return resources even when it misses the L1D. This makes it difficult for L1D misses to return data to the register file without impacting the flow of new L1D misses. Second, each fill operation will take an additional cycle to complete. Third, the need to fill the L1D eliminates an opportunity for the L2 OzQ to remove secondary L2 miss lfetch instructions. This is significant, because the L1D line size is half of the L2’s, and one lfetch per L1D line will result in at least one secondary L2 miss access for every L2 line, thus limiting L2 OzQ throughput.

One approach would be to use three separate lfetch instructions. An lfetch.fault.nt1 would bring the data into the L2. Later, when the data is in the L2, lfetch.fault instructions can hit in the L2 cache and bring the data into the L1D. This makes the lfetch instructions asymmetric and requires several load memory slots.

An optimization to the three-lfetch approach above would use only two separate lfetch.fault instructions, but stage them such that the first will bring data into L2 and the L1D. Then, when the L2 is filled from the first request, the second lfetch can bring the data into the L1D without being a secondary L2 miss (the L2 is filled so the lfetch is an L2 hit). This frees an additional load memory slot and makes the lfetch instructions re-usable.

An outstanding L1D fill may be invalidated by a store to the same line. Using lfetch instructions for even small data streams can result in a significant performance increase provided the lfetch fills the L1D before the store to the line is seen.

Further, since all loads that hit in the L1D never allocate into the L2 OzQ, using lfetch instructions to ensure an L1D hit may also help performance by limiting L2 OzQ to only store data and lfetch requests. This relieves pressure on the limited OzQ resources and reduces the possibility of conflicts among OzQ entries.

  • Store Data Streams: Since store instructions are always seen by the L2, there is no benefit to bringing store-destination data into the L1D. There are many benefits to using an lfetch.fault.excl.nt1 completer for destination streams. For instance, the .nt1 hint allows secondary L2 misses to be removed, and the core is not slowed by the L1D fills. The .excl hint ensures that the L2 data is ready to receive the store data.

 


Source

Intel® Itanium® Processor Reference Manual