Porting Code to Intel® EM64T-Based Platforms

by Robert Y. Geva
Principal Engineer, Intel Software and Solutions Group

Introduction

Porting code from IA-32 architecture to EM64T to take advantage of 64-bit involves tradeoffs in performance considerations.

Intel® Extended Memory 64 Technology (Intel® EM64T) is a 64-bit extension to Intel’s IA-32 architecture. Data can be accessed in 64-bit chunks, and large memory is addressable without special OS calls. This white paper introduces the architecture extensions and discusses performance tradeoffs when porting software from 32-bit to 64-bit. While being able to perform operations on 64-bit instead of 32-bit is an advantage for software that requires them, not all software can, in practice, take advantage of the additional computation bandwidth. Conversely, all data access addresses becoming 64-bits leads to increased pressure on HW resources, for example, the data cache. When porting SW from 32-bit in Intel IA-32 architecture to Intel EM64T architecture, there are performance implications resulting from differences in the architecture, implications of the microarchitecture and differences in software conventions. This white paper does not attempt to provide a characterization of applications that can benefit from porting to 64-bit-it merely discusses the tradeoffs.

We also note that several compilers are available that generate native code for EM64T: the Intel® Compiler and the Microsoft Visual Studio* 2005. Make strong efforts to optimize code for performance on Intel EM64T-based platforms.


Intel® EM64T Architecture

Intel® Extended Memory 64 Technology extends Intel IA-32 from a 32-bit architecture to a 64-bit architecture. EM64T introduces a new mode (referred to as “long mode”) which supports running 32-bit or 64-bit applications. For the purposes of this paper, we will concentrate on the three modes in which most software will run:

  • Legacy mode: In this mode the architecture executes 32-bit SW without 64-bit capabilities. A 32-bit operating system is executing.
  • Compatibility mode: In this mode, a 64-bit operating system executes a 32-bit application. This is a subset of long mode.
  • 64-bit mode: A 64-bit operating system executes code that was compiled for 64-bit and takes advantage of the new architecture. This is also a subset of long mode.

 

Extensions of the Register Set

Intel IA-32 architecture features eight general purpose registers, each of them is a 32-bit register. Intel EM64T architecture extends each of those registers to 64-bit, which are then referred to as RAX, RBX, RCX, RDX, RSP, RBP, RSI, and RDI. It also adds eight new registers, named R9 through R15. Each of the registers is addressable as a 64-bit register, a 32-bit register, a 16-bit register and an 8-bit register. For example, R11 is the 64-bit version, R11d is the lower 32-bit of the same register, R11w is the lower 16 bits of the register and R11l is the lower byte. The registers ESP, EBP, ESI and EDI, which are not 8-bit addressable in IA-32, are 8-bit addressable in EM64T. For example, SL is the lower 8 bits of ESI. The 8-bit registers AH, BH, CH and DH are available in EM64T. They cannot be used in the same instructions with the new 8-bit registers.

IA-32 features eight 128-bit XMM registers. EM64T doubles their number to 16, and their size remains unchanged-128 bits. Their names are XMM0 through XMM15.

The X87 registers are the same as they are in IA-32: ST(0) through ST(7).

The instruction pointer, EIP, changes from 32-bit to 64-bit and is renamed to RIP.

New Instructions

Most of the IA-32 instructions can take 64-bit operands, so from an encoding perspective there are many new instructions. That extension is somewhat obvious. Only two other noteworthy instructions are added: sign extension from 32-bit to 64-bit: movsxd, and cmpxchg16b-an extension of the IA-32 cmpxchg8b. All the instruction set extensions that were added to IA-32 up to SSE3 are available in EM64T. While on the IA-32 side, using them in software applications poses problems with respect to backwards compatibility, no such issues arise in using them in EM64T programs.

64-bit Mode Operation

In 64-bit mode, the default size of address is 64-bit and the default size of data is 32-bit. This means that an encoding of a register when used as a base or as an index register is interpreted by default to be the full 64-bit version of the register. A use of a register otherwise, i.e. as a source or destination data operand, is interpreted by default as the 32-bit version of the register. The default can be overridden by using one of 16 available REX prefixes. The REX prefixes are an addition to the EM64T architecture and are not available in the IA-32 architecture. They are one-byte prefixes, their values range from 40 to 4F1, and they have to be encoded as the prefix that is nearest to the instruction. For example, the instruction add EAX, 8 is encoded as 83 C0 08, and the instruction add RAX, 8 is encoded as 48 83 C0 08.

When an operation assigns a value to a 32-bit register but not to the full 64-bit register, the architecture requires that the upper 32-bits are zero extended. For example, if an instruction loads a 32-bit value from memory to EAX, then the upper 32-bits of RAX are zero extended.

New SW Conventions

Application Binary Interfaces (ABI’s) and SW convention, including function calling conventions, exist in order to ensure inter operability between software modules that were generated using different tools. Those could be different compilers, assemblers, low-level programming, etc. The conventions only need to be consistent within a given instruction set architecture and operating system. While for IA-32, the ABI is the same for the Windows* and Linux* platform, for EM64T they are different. The full specification for both can be found in several locations on the Web.

Some of the notable differences between the IA-32 ABI and the EM64T Windows ABI include:

  • In IA-32, all function arguments are passed on the user stack. In EM64T, the first four arguments are passed is registers. Integer arguments are passed in ECX, EDX, R8, R9. Floating point arguments are passed in XMM0 through XMM3. Only the first four arguments can be passed in registers. For example, in a callin g sequence of five argument, where the first, third and fifth are integer and the second and fourth are floating point, the first will be passed in ECX, EDX will be unused, the second will be passed in XMM1, XMM0 will be unused, and so on. The fifth argument will be passed on the stack.
  • In IA-32 EBP, EBX, EDI and ESI are callee saved. In EM64T on Windows, their 64-bit equivalents are also callee saved, and in addition, R12, R13, R14, R15 are callee saved, and so are XMM6 through XMM15.
  • While in IA-32 the stack is aligned on 4-byte boundary, in EM64T the stack is aligned on a 16-byte boundary. Data items are laid out so that they are aligned on their natural data size, in particular when they are field of structures and classes.
  • The Linux ABI is different.

 

1 In legacy and compatibility mode, 0x40 to 0x4F are the single byte inc and dec instructions.


Performance Considerations

The term architecture refers to the instructions set, including encoding of instructions, number and name of registers, and in general the aspect of the hardware platform that is exposed to the software. By contrast, microarchitecture is a particular implementation of architecture. Microarchitectural aspects include latencies of operations, sizes of caches, etc.

Architectural Considerations

The Intel® EM64T architecture makes more registers available to the assembly coder or the compiler. Registers are faster than memory and therefore being able to make good use of the additional registers is an advantage.

The general purpose registers are 64-bit. Therefore, programs that perform arithmetic operations on 64-bit integer values can benefit from the width of the register, whereas on IA-32, 64-bit operations have to be emulated with sequences of 32-bit operations. For example, in order to add two 64-bit values on IA-32, the required code sequences consist of adding the two lower 32-bit halves, adding the upper two 32-bit halves, and adding the carry bit from the first addition.

Access to new registers and to 64-bit flavors of registers require adding a REX prefix to the encoding of instructions. This creates larger encodings and reduces utilization of HW resources that are sensitive to code size, such as instruction caches. A potential optimization to alleviate the problem when possible is to prefer the general purpose registers that are available in IA-32 (EAX, EBP, etc) over the new ones (R8 through R15) and to strength reduce operations from 64-bit operations to 32-bit operations. For example, if a value is known to be within the bounds of 0 through 100, it is safe to use a 32-bit register to represent it, even when the context implies use of a 64-bit register. (It is generally not recommended to strength reduce to 16-bits or 8-bits because those also create larger encodings.)

Software Conventions Related Considerations

In EM64T, field of structures are aligned on their natural boundaries. For example, 64-bit floating point value and 64-bit pointers are all aligned on 64-bit boundaries. These alignment requirements can create “holes&rdqu o; inside structures, causing their size to potentially grow quite considerably. Consider for example the following structure:

      struct node {

      char *l;

      char s;

      strcut node *prev;

      int I;

      strcut tnode *next;
  
}

 

In IA-32, the size of this structure is five times 32-bit which is 20 bytes, including 3 bytes needed to align the “prev” pointer which follows the character “s.” In EM64T, the overall size is 40 bytes. Not only do the pointers double in size, but also the padding required to align the pointer “prev” is now 7 bytes, and there are 4 bytes of padding required to align the pointer “next” which follows the integer “i.” The conclusion is that while it is worthwhile IA-32 optimization to order field of structures and classes in decreasing order of size, this optimization is potentially a lot more beneficial for EM64T.

Microarchitectural Considerations

Obviously, the microarchitectural considerations can (and will) vary from one microprocessor to the other. This white paper can’t possibly list microarchitectural considerations for future Intel microprocessors supporting EM64T. Instead, it will list a few considerations that hold for Intel® microprocessors that implement the Intel NetBurst® microarchitecture.

Probably the single most effective software optimization consideration for the Intel NetBurst microarchitecture is to follow the store forwarding guidelines. When a code sequence stores a value from a register into memory in later, in close temporal proximity it reloads the value or part of the value, the hardware attempts to forward the value from the store operation directly to the load operation. When the internal forwarding is not possible, the load operation has to wait for the store instruction to complete its operation, write the value into the memory hierarchy, and then retrieve the value from memory. This could potentially delay the operation of the load, as well as subsequent operations that depend on the load by dozens of machine cycles. Fundamentally, the store forwarding rules in the Intel NetBurst microarchitecture state that a value will forward from a store to a subsequent load only if the start address of the two operations is the same, and either the widths are the same or the load is a proper subset of the store. Conversely, if the load is attempting to load from a memory location that is a superset of the value being stored, or is a part of the store which is not the least significant part, forwarding cannot happen.

Example:

movsd QWORD PTR _x, XMM0 // store the lower 64-bit from the register XMM0 to 

                         // memory location X


mov eax, DWORD PTR _x // load 32-bits from the memory location x to the 
        
                      // registerEAX. This will forward from the earlier store

                      // operation.


Mov edx, DWORD PTR _x+4 // load 32-bit from memory location x+4. This was stored

                        // in the previous store operation, however it cannot

                        // forward. 

 

Another effective set of software optimization techniques for the Intel NetBurst microarchitecture is to avoid false dependencies. False dependencies occur when software uses machine resources that the hardware views as parts of a single resource, such as parts of registers. There are several ways in Intel® architecture to load a value from memory into a register which is a proper part of another register. For example, load a byte into AL but not to EAX. Since the hardware maintains EAX as a single-entity 32-bit register, it has to follow a load into AL by an implicit operation that merges the new value in the lower 8 bits with the previous value in the upper 24 bit. It is therefore preferable to use instead a zero extending load that assigns a value to the whole register.

Another example is to load a 64-bit value from memory to XMM0 without impacting the upper 64-bit of the 128-bit register. Note that in SSE2, the instruction movlpd xmm0, QWORD PTR _x loads 64-bit from memory to the lower 64 bits of XMM0, leaving the upper 64 bit unchanged. Again, the hardware will have to merge the new, lower 64 bit with the old, upper 64 bits creating a false dependence on the previous operation that had set that value. The instruction movsd xmm0, QWORD PTR _X also loads 64 bits from memory into the register XMM0; however, it also zero extends the upper 64 bits of the register, preventing the false dependence on the previous instruction which set the value in those upper 64 bits. The movsd load is therefore preferable to the movlpd load. Finally, the instruction movsd XMM1, XMM2 copies the lower 64 bit from XMM2 to XMM1 leaving the upper 64 bit in XMM1 unchanged. Again, the hardware will have to merge the new value with the previously set upper 64 bits, creating a false dependence on the previous instruction that had set a value in those upper 64 bits. Therefore, it is recommended to use an instruction such as movapd XMM1, XMM2 and copy all the 128 bits even in context that only require the lower 64 bits.

When software needs to convert a value from a narrow value to a wider value, the value needs to be either sign extended or zero extended. For example, when a character value is combined in an operation with an integer value, it first needs to be sign extended to the size of an integer. The instruction movsd eax, al sign extends the 8-bit value in AL to a 32-bit value in EAX. However, as required by the architecture, the upper 32 bits of the register RAX have to be zero extended. This requires the hardware to perform two operations: first, sign extend the eight bits to 32 bit, then to zero extend the upper 32 bits. It is more efficient to sign extend AL into RAX instead, i.e. sign extend to 64 bits even in context where this is not required, as then the hardware can perform the sign extension in one operation.


Conclusion

This white paper introduced the Intel® Extended Memory 64 Technology architecture, which extends the IA-32 architecture to 64 bits. While intuitively, 64-bit operations should be more efficient than 32-bit operations and the architecture is an innovation and perceived as moving forward, this paper has shown that in practice, there are non-trivial tradeoffs. Performance impact of porting from 32-bit to 64-bit can lead to mixed results.

We are unable at this time to provide useable characterization of those applications that are guaranteed to benefit from porting to 64-bit. Instead, we have shown some of the key tradeoffs and in some cases recommended software optimizations to alleviate them.

Unfortunately, the best advice for those who are interested in porting is to measure the performance of their application to ensure their customers benefit from moving to the 64-bit architecture. In addition, readers are advised to make sure their code follows the guidelines in this white paper. Assembly language programmers should write their code accordingly. High-level language programmers who explore slow-downs in their code may benefit from analyzing the compiler-generated code in key parts of their application and ensure that it follows these recommendations.

Acknowledgements

The author is thankful to many colleagues for joint work, useful conversation and providing insights into performance, specifically on Intel® platforms supporting EM64T. In particular, Kevin B. Smith, John Holm, Ronak Singhal, CJ Newburn, Bob Valentine, Zeev Sperber, Patrice Roussel and Walter Shands.


Additional Resources

 


Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.
Categorias:
Tags: