Instruction Latencies in Assembly Code for 64-Bit Intel® Architecture

Submit New Article

March 4, 2009 8:00 PM PST



Challenge

Optimize assembly-language code for the Itanium® processor family in terms of instruction latencies. The latency of an instruction is the length of time that has elapsed from when the instruction is issued until the time that its results can be used. For most simple integer math operations, like "add r32=r33,r34", the latency is a single cycle, so it is possible to use the results of many operations in the very next set of parallel instructions. This is generally not true for floating-point operations or loads from memory.


Solution

Organize instructions in such a way that they will not have to wait for source registers to be "ready." Up to six instructions can dispatch in parallel on the Itanium® processor, but if any of the source operands of any of those six instructions has not completed, all six instructions will be held up until the latency wait has completed.

When the result of an operation is ready to be used on the very next cycle, it is said to exhibit one-cycle latency. In similar terms, the following table shows the latencies of some of the more important assembly-language instructions:

Instruction Type Instructions Latency (cycles)
Floating Point multiply-and-add (fma) 5
Floating Point convert integer to fixed-point floating point (setf) 9
Floating Point convert fixed-point floating-point to integer (getf) 2
Floating Point fixed-point to/from floating-point conversion (fcvt) 7
Floating Point fixed-point floating point multiply-and-add (xma) 7
Memory Load integer from L1 cache 2
Memory Load integer from L2 cache 6
Memory Load integer from L3 cache 22
Memory Load integer from main memory ~200
Memory Load floating point from L2 cache 9
Memory Load floating point from L3 cache 24
Integer Compare latency to dependent branch 0

 

Up to this point, it has been largely assumed that most memory loads can be satisfied by L1 (Level 1) data cache in two cycles. This is by no means the rule in practice. It is important to keep in mind not only the latencies of the various levels of cache, b ut their respective sizes, as shown here:

Cache Level Size Integer Latency FP Latency
L1 Instruction 16KB NA NA
L1 Data 16KB 2 NA
L2 96KB 6 9
L3 4MB 22 24
Main Memory any ~200 ~200

 

At any given time, there is only 16K of the highest-speed data cache memory available. It is organized in cache lines of 32 bytes each. It is generally useful to try to organize data so that a reference to one value in a cache line corresponds to references to other data in the same cache line.

To the extent that data references are spread out somewhat randomly through memory, cache can be defeated.

Most compiled code assumes that integer values are loaded from L1 cache, and sometimes prefetching is used to help ensure that data will be in L1 when it is needed. However, if an integer load misses the L1 cache, it can take much longer to load, as shown by the various cache and memory latency times given above. The latencies shown are considered typical best case, but can be longer at times, depending on what mix of operations is occurring on the processor and in the memory subsystems.


Source

Recognizing Efficient Use of Caches in Code for the Itanium® Processor Family