Register-Stack Engine Stalls on 64-Bit Architecture

Submit New Article

March 10, 2009 12:00 AM PDT



Challenge

Identify the source of stall cycles due to invocation of the Register Stack Engine (RSE). There are 96 general registers used for the register stacks. A deep call stack or a call stack through functions with heavy register needs can exceed this resource. Such situations require the RSE to spill the values stored in these registers for higher levels of a call chain to a backing store. The RSE then recovers the values as the call stack is unwound. This occurs automatically as the need arises.

When the RSE is invoked, the core pipeline stalls. The BE_RSE_Bubble event accumulates these stall cycles. The RSE pushes out only the general registers to the backing store. FP registers are not swapped out to the backing store in this manner. FP registers must be spilled explicitly by the generated code. This is a rare occurrence, since large numbers of FP registers are required only for local usage, as in the execution of heavily unrolled pipelined loops. However, the general registers serve as the basis of the call-return mechanism of argument passing, and as such must be recoverable to unwind a call stack chain.


Solution

Use the Intel® VTune™ Performance Analyzer to analyze the subevents of the BE_RSE_Bubblecounter. The BE_RSE_Bubble counter has a number of subevents. The complete list of subevents for this counter are listed in the following table. However, the fact that the RSE was invoked can, in itself, tell you what you need to know.

Extension PMC.umask Description
ALL bx000 Back-end was stalled by RSE.
BANK_SWITCH bx001 Back-end was stalled by RSE due to bank switching.
AR_DEP bx010 Back-end was stalled by RSE due to AR dependencies.
OVERFLOW bx011 Back-end was stalled by RSE due to need to spill.
UNDERFLOW bx100 Back-end was stalled by RSE due to need to fill.
LOADRS bx101 Back-end was stalled by RSE due to loadrs calculations.
--- bx110-bx111 (* nothing will be counted *)

 

If BE_RSE_Bubble is a significant source of stall cycles contributing to the primary sum rule (see the separate item Analyze Pipeline Stalls on 64-Bit Intel® Architecture, then you can easily determine the appropriate reprogramming actions to take in order to remove the stalls.

The RSE is invoked when there is an excessive use of general registers. Excessive use of general registers may occur in a few routines or in a heavily used kernel routine with complex recursive algorithms. A recursive chain that requires many registers at each level quickly saturates the physical limitations of the 96-deep register file.

You can generate an example of how to invoke the RSE by writing a simple recursive algorithm to calculate the sum of

1/2**n 
double recursive(double x)
{
double temp, epsilon=0.001
temp=x/2.
if(temp < epsilon) return 0.0
return temp+recursive(temp)
}

 

The above function calls itself recursively 10 times. If you compile the function with the /Fa (Windows*) option, it will yield an assembler listing that you can edit. Modify the alloc statement from:

alloc r33=ar.pfs,1,2,1,0

to one that allocates 66 local registers at each level:

alloc r33=ar.pfs,1,65,1,0

You can verify using the BE_RSE_Bubble.All counter that the RSE is invoked at each recursive level. It consumes approximately one-third of the total cycles.

It must be emphasized that heavy usage of general registers is the cause for invoking the RSE. If BE_RSE_Bubble contributes significantly to CPU_Cycles, then follow one of the following strategies:

  • Simplify the algorithms so they do not use so many general registers.
  • Change the compiler options.

 

As indicated in the example (even though it is very artificial), using recursive algorithms in CPU-intensive parts of the program can result in very deep call stacks, creating the need to free up registers. Recursion should be avoided in kernels on any architecture, as winding and unwinding deep call stacks is never very efficient.

For other strategies to remove RSE activity, see Chapter 9 of the manual listed below.


Source

Introduction to Microarchitectural Optimization for Itanium® Processors