| Last Modified On : | October 3, 2008 4:55 PM PDT |
Rate |
|
By Walt Triebel and Joe Bissell
The Itanium™ processor achieves high performance by efficiently and effectively employing parallelism—issuing multiple instructions for execution at the same time. The architecture is equipped with vast execution resources—nine independent execution units—so that it can ideally disperse six instructions for execution every clock cycle. However, depending on the instruction mix and their match to execution units, the Itanium processor's instruction dispersal mechanism allows a varying number of instructions to be issued for execution each clock cycle. This versatile instruction dispersal capability, along with a number of other advanced architecture capabilities, enable the Itanium processor to sustain a high ratio of instructions per clock. This article examines:
Figure 1 identifies the execution resources available in the Itanium processor:
The Itanium processor's execution units are designed to efficiently perform a specific type of operation—memory access, integer computation, floating-point computation, or branch-in-control flow. The Itanium processor can issue six instructions for each clock, but the types of instruction must match these resources.
Figure 1. Full utilization of execution resources
For the Itanium processor, the compiler determines how instructions should be ordered for best execution. As part of the compilation process, the program operation may be profiled to gather information about how it runs. This profile information is used to create efficient code for parallel execution on the Itanium processor. As part of this process, the compiler may reorder instructions and apply advanced architectural capabilities—such as predication, speculation, and software pipelining—to produce a code stream that more effectively employs the execution resources of the Itanium architecture.
Table 1 relates the mapping of Itanium instruction types (A, I, M, F, B, and L+X ) to the available execution units. The instruction types are reasonably self-explanatory, with the exception of L+X. This is a special instruction type that permits coding of a 64-bit immediate operand.
Table 1. Relationship Between Instruction Type and Execution Unit Type
Analysis of the frequency of instruction use in SPECint benchmark programs has shown that as many as 40% of the instructions in the individual suite programs are classified as Itanium integer types (compare, add, subtract), about 35% Itanium memory types (load/store), and 20% Itanium branch type (branch). To simplify the compiler's instruction scheduling and improve performance, the Itanium architecture maps integer type (A and I) instructions to either I or M units. Therefore, four of the nine execution resources can perform integer operations.
The Itanium compiler bundles instruction into groups of three for the Itanium processor to execute. Because the Itanium processor can disperse six instructions at one time, the dispersal mechanism always has access to two bundles of instructions. Figure 1 shows two bundles of instructions in position to be dispersed for execution, and two more bundles entering from the I-cache. To achieve efficient execution, the instruction sequence in these bundles must match the available execution units. The first bundle is identified as an MFI template, which means that it contains a memory or integer instruction in the first slot, floating-point instruction in the second slot, and integer instruction in the third slot (for further details on instruction templates, bundle organization, and identifying bundles in Itanium code, see the book Programming Itanium-based Systems).
The instruction dispersal diagram of Figure 1 demonstrates ideal matching of instructions to the execution resources. Notice that the MFI and MIB bundles in the Itanium processor's dispersal window match the available execution units. Here is an example instruction sequence coded with these templates.
{ .mfi //two bundles dispersed in one clock cycle
ld4 r34=[r5] // Maps to M0, cycle 1
fadd f39=f37, f6 // Maps to F0, cycle 1
extr r36= r37,5,1 // Maps to I0, cycle 1
} { .mib
ld4 r40=[r45] // Maps to M1, cycle 1
add r39=r37, r6 // Maps to I1, cycle 1
br.sptk.many b0 ;; // Maps to B2, cycle 1
}
Since both bun dles are filled with instructions without dependencies, six instructions are assigned to execute in the current clock cycle. Dispersing six instructions in a single clock is referred to as a full issue.
Even though the Itanium processor dispersal window is six instructions wide, it may not issue six instructions every clock cycle. For instance, examining the Itanium bundle template shows that it cannot do every combination of six things at once. The compiler applies upward and downward code motion in an effort to rearrange the instructions to match available templates. However, in some cases, a match is just not possible. The architecture's solution to this issue is for the compiler to select an appropriate template and form a full bundle by filling the unused instruction slot(s) with nops (nop.i, nop.m, nop.f). The instruction sequence that follows employs a nop.i in the MIB bundle.
{ .mfi //full issue dispersal
ld4 r34=[r5] // Maps to M0, cycle 1
fadd f39=f37, f6 // Maps to F0, cycle 1
extr r36= r37,5,1 // Maps to I0, cycle 1
} { .mib
ld4 r40=[r45] // Maps to M1, cycle 1
nop.i // Maps to I1, cycle 1,but wastes a slot
br.sptk.many b0 ;; // Maps to B2, cycle 1
}
These two bundles disperse five instructions. This partial issue is less than optimum, but still represents relatively efficient use of Itanium processor resources.
Dependencies between the instructions in a program also limit the compiler's ability to structure code for parallel execution. For example, a data dependency exists between an instruction that accesses a register or memory location and another instruction that alters the same register or storage location. Interchanging the order of these two instructions or executing them at the same time may produce incorrect results.
The data dependencies that affect code scheduling by the Itanium compiler are instruction sequences that perform:
(For further details on dependencies and templates with stops, see the book Programming Itanium-based Systems.)
The following shows a fragment of code that contains an aliasing condition:
{ .mii //split issue dispersal
sub r31=r33,r32 // Maps to M0, cycle 1
extr r36= r37,5,1 // extr maps only to I0, cycle 1
add r39=r37, r38 // Maps to I1, cycle 1
} { .m_mi
ld8 r2=[r3] ;; // Maps to M1, cycle 1
st8 [r4]=r5 // Maps to M0, cycle 2,
add r6,r7 // Maps to I0, cycle 2
}{ .mib
sub r45=r44, r46 // Maps to M0, cycle 2
add r47=r48, r8 // Maps to I1, cycle 2
br.sptk.many b1 // Maps to B1, cycle 2
The second bundle (.m_mi) involves a load instruction and a store instruction that indirectly reference memory. The compiler cannot assure that registers r3 and r4 do not contain the same address. For this reason, it does not move the store before the load or execute them at the same time. The compiler marks the dependency between these instructions with ;; after the first instruction (ld8).
Since bundle two has a dependency between the two memory access instructions, the bundle is coded with another variant of the MMI template (.m_mi). The underscore between the two memory instruction slots represents a stop and causes the instructions to issue with a split dispersal.
The split dispersal in Figure 2 shows how the Itanium architecture handles data dependencies. In the dispersal window of the upper diagram, the second instruction in the second bundle contains the memory slots of the ld8 and st8 instructions. Therefore, only the four instructions before the stop are dispersed to execution units during the first clock cycle—the split issue. The lower diagram shows what happens during the next clock cycle: the original bundle 2 is rotated forward in the dispersal window into the previous position of bundle 1; the next bundle from the instruction stream fills in as bundle 2; and the unexecuted instructions from the new bundle 1 (if independent of those in the new bundle 2) are all dispersed together with those from the new bundle 2.
Figure 2. Split issue instruction dispersal due to data dependency.
This sequence of bundles demonstrates that in spite of the existence of a data dependency, the Itanium processor's execution resources are still employed at 75% average utilization—a total of nine instructions issued to execute during the two cycles.
Scalability is another important asset of the Itanium architecture. Future Itanium family microprocessors will be more flexible about what they can do in parallel and how many things can be done in parallel. The second generation Itanium processor, the Intel Itanium 2 processor, has additional execution units, and their availability permits more instruction combinations to be executed in parallel without causing a split issue. In the Intel Itanium 2 processor, the execution environment is expanded with two additional memory units. Four memory units (M0, M1, M2, and M3) provide more memory access and more ability to perform integer operations—6 of the 11 execution units are available to execute integer instructions.
What does this mean to instruction dispersal? The Intel Itanium 2 processor still disperses a maximum of six instructions per clock cycle, but the compiler has greater choice in matching instructions to the execution units. Figure 3 is a matrix comparing the possible full dispersal bundle-pair configurations supported by the Itanium processor and the Intel Itanium 2 processor.
Figure 3. The Intel Itanium 2 processor instruction dispersal matrix
The templates down the left column represent the first bundle in the dispersals window and those across the top represent the second bundle. Bundle configurations dispersed by either the Itanium processor or the Intel Itanium 2 processor are highlighted in orange; those that are only permitted for the Intel Itanium 2 processor (code name) are in green. The availability of six memory-integer execution resources more than doubles the bundle combinations that represent full dispersal for the Intel Itanium 2 processor. Code generated for the Intel Itanium 2 processor has less nops, fewer split dispersals, and a boost in performance.
The Itanium processor does not support full dispersal of the MII-MII code sequence that follows:
{ .mii // two bundles dispersed in one clock
alloc r33=ar.pfs,1,9,2,0 //0:->Cycle 1
mov r32=b0 //0:->Cycle 1
add r41=r40,41 //0:->Cycle 1
} { .mii
add r42=0x0,r0 //0:->Cycle 1
add r38=r37,r39 //0:->Cycle 1
add r36=r35,r34 //0:->Cycle 1
The second bundle is compiled with an alternate template and filled with nops, which results in the issue of just four instructions during a clock cycle. The dispersal matrix identifies that the Intel Itanium 2 processor supports full dispersal for this bundle-pair combination. Figure 4 illustrates how the Intel Itanium 2 processor disperses the individual instructions for execution in the same clock cycle.
Figure 4. The Intel Itanium 2 processor full dispersal.
Code compiled for the original Itanium processor executes on future family members without recompiling. However, recompiling makes better use of the additional architectural resources. The result is higher performance; in fact, the estimate is that systems running the Intel Itanium 2 processor should deliver a 1.5X to 2X performance improvement.
Hammond, Gary and Sam Naffziger. Next Generation Itanium™ Processor Overview, Santa Clara, CA: Intel Corporation, August 2001.
Hennessy, John L. and David A. Patterson. Computer Architecture a Quantitative Approach, 2nd Ed. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1996.
Intel Corporation, Intel Itanium™ Architecture Software Developer's Manual, Volumes 1-4, Documents 245317-001, 245318-001, 245319-001, Santa Clara, CA: Intel Corporation, 2000.
The Itanium™ Reference Manual for Software Optimization, Document 245473-003, Santa Clara, CA: Intel Corporation, November 2001.
Triebel, Walter. Itanium™ Architecture for Software Developers, Hillsboro, OR: Intel Press, 2000.
Triebel, Walter, Joseph Bissell, and Rick Booth. Programming Itanium™-based Systems, Hillsboro, OR: Intel Press, 2001.
