| Last Modified On : | October 20, 2008 2:06 PM PDT |
Rate |
|
By Scott Townsend
The Race Track Analogy
If your application program is a racecar and the computer it runs on is the racetrack, then you'd like to see as many straight-aways as possible and very few curves. After all, we always want our programs to run faster. Think of memory accesses as the curves, because the computer always has to slow down to access memory. Processor register accesses, on the other hand, are the fat and wide straight-aways where the processor really flies. Itanium architecture turns your curvy Laguna Seca track into a virtual drag strip and you may not even have to re-write your code. How? By using registers where other processors must use memory.
A feature of processors in the Itanium® processor family is the Register Stack Engine (RSE). The RSE is a hardware implementation inside the processor that helps a subset (the last 96) of the General Registers (Gr) implement the register stack by handling register overflows. Its main function is to act like the traditional memory stack, except much faster. Without the RSE implementation, code in a program calls a function, puts the parameters to pass to the function on the stack (in memory), and the receiving function must retrieve them from the stack into registers to manipulate them. But in the Itanium processor family, the register stack enables extremely fast switching of the function call process with little to no overhead. Each time a function is called, it is allocated a group of registers, called its "Register Stack Frame", from these 96 registers. The allocated registers in a function's Register Stack Frame are temporary locations that store operands local to the function and those that are input and output from the function.
To Illustrate...
Assume a case where Function A calls function B. The hardware allocates Function B its own Register Stack Frame. The hardware then creates overlaps between the registers which hold the outputs of function A and those that serve as the inputs of function B. By writing information to these output registers A can now pass parameters to B without any overhead and once done, B can pass values back to A by writing in these common shared registers as well. When nested calls exhaust all the available 96 stacked registers the RSE comes into play. The RSE is responsible for automatically spilling stack information from the register stack to memory (called the "register stack backing store") to free up register space for newly called nested functions. When the register resources on the processor are available again, the RSE takes data that was temporarily held in the register stack backing store and moves it into the registers.
What is the advantage of this design?
Traditional processors have very few registers inside the processor and are unable to implement a stack frame architecture. Calling a new function requires an expensive task switch operation in which the current register information needs to be stored on the main memory stack of the calling application. Once the called function exits, register data from the calling function is repopulated inside the processor from the main memory stack before execution continues. The return values of the called function are lar gely stored in memory with retrieval requiring expensive MM reads. In the Register Stack model all this overhead is eliminated with complete parameter passing occurring in registers. Overhead is only incurred by RSE operation when all the 96 stacked registers are exhausted, which still comes in at lower time cost than the switch of the traditional processors.
The result is a dramatic performance gain over traditional stack-based computing, because memory accesses are greatly reduced, code basically runs in two ways: sequential execution and branching. It's the function calls that often take up the majority of the processing time because of the number of memory accesses to store and retrieve parameters. Programs for the Itanium processor family find a smooth speedy path because they take full advantage of the fast processor registers and minimize accesses to memory.
As noted earlier, a stack is a memory area used to store data for local variables, function parameters, and the return address for function calls. High-level language compilers typically generate code which uses the stack to pass parameters to a function and to retrieve the function's return value for the caller. Code can use processor registers instead of the stack, but that would require a very sophisticated register management scheme to preserve the caller's register contents while the called function did its work. And with only a small number of processor registers at its disposal, the compiler would soon run out of registers while the nesting level of function calls would be very shallow.
Using the stack solves this problem, because the stack is large compared to the number of available registers, and the last-in-first-out (LIFO) nature of a stack fits well with nested function calls.
Now let's take a look at how to perform stack-based parameter passing by inspecting the assembly code generated by a 'C' compiler. Figure 1 shows a snippet of code that calls the function test() and passes in an integer parameter. Variable 'a' is initially stored in the AX register. Looking at the assembly code notice that 'a' is "pushed onto the stack" so that test() can retrieve it for calculating. This copies 'a' into a memory location indexed by the ESP register. Function test() would then copy the parameter into a register or manipulate it directly in its memory location on the stack.
// Function call passing two parameters
test(a, b);
push ax ; memory
push dx ; memory
call test
add esp,8
// Function
int test(int a, int b)
{
int c = a;
return c;
}
push ebp ; memory
mov ebp,esp
sub esp,44h
push ebx ; memory
push esi
push edi ; memory
lea edi,[ebp-44h]
mov eax,dword ptr [ebp+8] ; memory
mov dword ptr [ebp-4],eax ; memory
mov eax,dword ptr [ebp-4] ; memory
pop edi ; memory
pop esi ; memory
pop ebx ; memory
mov esp,ebp
pop ebp ; memory
ret ; memory
Figure 1 - Itanium and Itanium 2 processor assembly code generated by a 'C' compiler for a simple function.
It's interesting to note that all of the instructions in Figure 1 access memory, except lines 5, 6, and 10. Since all function parameters and local variables reside on the stack, the number of memory accesses in this code is very high. Imagine if this were a more complex function with lots of local variables and more parameters passed in. This simple example demonstrates how high level language compilers can generate very memory-intensive code. Virtually all compiled languages including Microsoft Visual Basic*, Java*, C++, etc. use stack-based parameter passing.
How Does the Register Stack Architecture Work?
Of the 128 general-purpose registers of the Itanium processor family, the f irst 32 registers (GR0-GR31) are static and the remaining 96 registers (GR32-GR127) are stacked. The register stack is used for input parameters, local variables, and return values of functions (see Figure 2). The register stack serves a similar function as the memory stack in traditional processors, but the register stack is managed internally by the processor.
Figure 2 - Each function has its own Register Stack Frame.
A function creates a Register Stack Frame (set of adjacent registers) in the register stack based on its needs for inputs, locals, and outputs. Yes, that's multiple outputs! In an Itanium or Itanium 2 processor a function can return more than one parameter. An alloc instruction at the beginning of the function specifies how many input parameters, local variables, and output parameters are needed (see Figure 3). The processor then reserves these required logical registers as a contiguous block from the register stack set, thus creating a stack frame. When function test() is called, a new stack frame is created for test(). If the caller passes any parameters into test(), then the stack frames of the caller and test() will overlap such that the output parameter of the caller becomes the input parameter of test().
Notice that the code in Figure 3 has only 7 instructions as compared to 12 instructions in the code in Figure 1. You can see how much simpler the code becomes when there is only one line per function (lines 1, 3) to set up and manage parameter passing, local variables and return parameters. Finally, we see that none of the instructions in Figure 3 access memory. This code is going to execute at maximum speed because it doesn't have to wait for memory.
Looking again at the example of Figure 1 we see instructions, in both the calling and called sections, to set up the stack, access parameters, prepare return values, as well as manage stack usage so it stays correctly aligned. All of this generates code overhead. Take a look at lines 3, 4, 5, 6, 10, and 11 in Figure 1. These types of instructions aren't needed for the Itanium processor family due to the "Register Stack" architecture of the General Registers, which allocates individual "Register Stack Frames" to every called function rather than using the application stack in Main Memory to store Local, Input and Output parameters. Fewer instructions to execute from Main Memory means that the work gets done faster and more efficiently.
There's a performance hit associated with using the stack (or any other memory area) in a traditional fashion, compared to using processor registers, that is not often exposed. Any processor instruction that uses memory takes more time to execute than the same instruction that uses a processor register. It's mainly because the processor has to go out to the memory bus to read or write the memory location and wait until the operation completes before moving on to the next instruction. Instructions that use processor registers exclusively don't have to wait for the memory cycles to complete; they perform all their work at full speed inside the processor. In addition, the memory bus often runs at a slower speed than the processor itself, so this adds more time to do the read or write to memory . There are other side effects of accessing memory that can cause further performance hits such as cache misses, but we'll discuss those later.
// Function call passing a parameter
test(a);
1: (p0) br.call b0=test // Call test(a), 'a' is in R32
// Function receiving a parameter
int test(int a)
{
int c = a;
return c;
}
2: .proc
3: test:
4: alloc r4=ar.pfs,1,1,1,2 // Set up register stack - 1 input, 1 local, 1 output.
5: (p0) mov r33=r32 // Get 'a' into 'c'.
6: ;; // End of instruction bundle.
7: (p0) mov r34=r33 // Set 'c' as return value.
8: (p0) mov ar.pfs=r4 // Restore R4.
9: (p0) br.ret.sptk.few b0 // Return to caller.
10: .endp
Figure 3 - Assembly code for an Itanium or Itanium 2 processor, generated by 'C' compiler for the same function as shown in Figure 1.
To simplify keeping track of logical registers versus physical registers, Itanium and Itanium 2 processors rename registers to allow the logical registers in a given function's stack frame to always be addressed starting at R32. Though the register stack frame of any called function may start from any one of the physical registers Gr32-Gr128 every called function assumes that it has the whole 96 registers to its disposal and logically renames its starting physical register to Gr32. The naming of subsequent physical registers of the function's register stack frame cont inues, logically renaming to Gr33, Gr34 and so on. This phenomenon is known as register renaming and it makes your code more consistent and easier to understand. It also enables certain register specific operations known as "Register Rotation", which is used in specialized loop optimizing constructs known as Software Pipelining.
What happens when the register stack gets full in deep level nesting like recursive procedure calls? When the register stack gets full, the RSE automatically does a burst write to memory (spill) and that frees up the register stack for further function nesting. Later as functions return, the previously stored copy of the register stack is read back in from memory (fill) as code continues unwinding.
Modern code is deeply nested both at the application level as well as the system level. Component interfaces, component binding, bounds checking and other housekeeping all rely on function calls to do their work. Often layers upon layers of function calls lie between an application program's attempt to simply open a file; all the way down through the operating system layers and device driver model to the low level code that actually does the work. If the majority of all this code used processor registers instead of the stack to pass parameters back and forth, the performance increase would be dramatic. The Register Stack Architecture allows faster execution as processor registers are utilized instead of memory for all these calls.
Multi-Tiered Development
With the increase of distributed application development, multi-tiered software layers are common nowadays. The business rules layer components in the middle tier require that they be state-less and so typically have a lot of parameters passed into their functions. Interfaces to state-less components pass more parameters because they must complete a transaction in its entirety during the function call. Having so many parameters passed in is a prime candidate for performance improvement by taking advantage of the Register Stack Architecture.
Cutting the Overhead
Those well schooled in software optimization know that most of the overhead in a set of recursive function calls resides in the calling itself. The actual work inside a recursive function is usually minimal. Passing the function parameters, allocating local variables, and setting up the return value all heavily use the stack, and this adds up when a function is called hundreds of times deep. It's the epitome of deep function nesting and a good target for performance improvement using the Register Stack Architecture.
In performance-critical parts of a program it is common to re-write the code in assembly language to reduce the execution time and hand-tune the performance. One of the biggest reasons for this is because hand-written assembly language can utilize processor registers to store and manipulate variables for all of the speed-sensitive sections of code. Compiler-generated code doesn't do this by default. It uses the stack for function parameters and most variables. Code compiled for an Itanium or Itanium 2-based system will use registers instead of the stack most of the time, resulting in a built-in program optimization. It w ould be nice for us programmers if we could do a lot less assembly language coding and let the compiler do some of the work for us.
Any code that deeply nests function calls can benefit from the speed up that the Register Stack Architecture gives. Some examples include code that traverses search trees; calling a function repeatedly to return a pointer to an object or a property of an object; recursive function calls; overhead for component interfaces.
More is Better
The enormous register set of the Itanium processor family allows compilers to generate code that assigns many more program variables to processor registers. When the back end of a compiler is generating object code for a given section of source code, it keeps track of which variables are stored in registers. If there are more variables than there are available registers, the rest of those variables must be assigned to memory. More available registers means there's less chance the compiler will have to assign a variable to memory. As we've seen earlier, code execution is much faster when registers are used for program variables instead of memory.
The Cache Factor
The memory cache can help reduce the performance hit when stack-based parameter passing is used, particularly if there are repeated accesses to the same or nearby memory locations. But the cache can't help out all the time and inevitably there are cache misses. A cache miss forces the processor to wait while the cache contents are synchronized with (copied to or from) an entire section of memory. Memory cache misses really hold up processing while the cache takes time to refill. A side benefit of the RSE is that since memory isn't accessed nearly as much in computing on Itanium and Itanium 2-based systems, memory cache misses are reduced proportionately. The net result is significantly faster program execution, because registers are used more often and memory is only accessed when it's really necessary.
Working Side-By-Side
Register spill and fill operations with the help of the RSE can run concurrent with program code. When the register stack becomes full as function calls nest deeper, the processor will spill the contents in a high-speed burst write to memory (known as the "Backing Store") while the program continues to run. The same is true for unwinding as code returns from a set of deeply nested function calls and the register stack becomes empty. It will fill back up in a burst read from memory at the same time as program code executes so as not to waste time.
Although Itanium and Itanium 2 processors have the ability to directly execute 32-bit code, the Register Stack Architecture is disabled in this mode and therefore these programs aren't able to take advantage of the register stack. Code written in high-level languages such as C++ and Java only needs to be re-compiled for Itanium or Itanium 2 processors to gain the performance benefit, since it is the nature of the compilers to use the Register Stack Architecture.
Looking toward the future, software development is becoming more abstract and layered. The result is a greater burden on code to execute functions and abstraction layer translations faster. The Regist er StackArchitecture is a sure bet to boost performance in these critical areas.
The following resources give further information about the Intel and Microsoft compilers:
Articles
Developer Centers
Community
Other Resources
Scott Townsend is Senior Principal Software Engineer at AudioRamp, an internet audio company, where he develops Windows CE and server applications. Prior to that Scott was the lead architect on Phoenix Technologies' IA-64 BIOS development project working closely with Intel on IA-64 technology. An avid musician and surfer, Scott resides in Southern California with his wife Riri (when he's not surfing in Bali).
| November 27, 2009 1:06 AM PST
lhardy4
|
i have beenrunning pentiunm 4 proccseser is slowing downeed somthing with more speed an acurit regestry paths |

Vishal K
Lucid explanation. I'd like to know more about RSE spill mechanism and bursting out data to main memory. Thanks for the article.