Register-Stack Engine Stalls on 64-Bit Architecture


Challenge

Identify the source of stall cycles due to invocation of the Register Stack Engine (RSE). There are 96 general registers used for the register stacks. A deep call stack or a call stack through functions with heavy register needs can exceed this resource. Such situations require the RSE to spill the values stored in these registers for higher levels of a call chain to a backing store. The RSE then recovers the values as the call stack is unwound. This occurs automatically as the need arises.

When the RSE is invoked, the core pipeline stalls. The BE_RSE_Bubble event accumulates these stall cycles. The RSE pushes out only the general registers to the backing store. FP registers are not swapped out to the backing store in this manner. FP registers must be spilled explicitly by the generated code. This is a rare occurrence, since large numbers of FP registers are required only for local usage, as in the execution of heavily unrolled pipelined loops. However, the general registers serve as the basis of the call-return mechanism of argument passing, and as such must be recoverable to unwind a call stack chain.


Solution

Use the Intel® VTune™ Performance Analyzer to analyze the subevents of the BE_RSE_Bubblecounter. The BE_RSE_Bubble counter has a number of subevents. The complete list of subevents for this counter are listed in the following table. However, the fact that the RSE was invoked can, in itself, tell you what you need to know.

ExtensionPMC.umaskDescription
ALLbx000Back-end was stalled by RSE.
BANK_SWITCHbx001Back-end was stalled by RSE due to bank switching.
AR_DEPbx010Back-end was stalled by RSE due to AR dependencies.
OVERFLOWbx011Back-end was stalled by RSE due to need to spill.
UNDERFLOWbx100Back-end was stalled by RSE due to need to fill.
LOADRSbx101Back-end was stalled by RSE due to loadrs calculations.
---bx110-bx111(* nothing will be counted *)

 

If BE_RSE_Bubble is a significant source of stall cycles contributing to the primary sum rule (see the separate item Analyze Pipeline Stalls on 64-Bit Intel® Architecture, then you can easily determine the appropriate reprogramming actions to take in order to remove the stalls.

The RSE is invoked when there is an excessive use of general registers. Excessive use of general registers may occur in a few routines or in a heavily used kernel routine with complex recursive algorithms. A recursive chain that requires many registers at each level quickly saturates the physical limitations of the 96-deep register file.

You can generate an example of how to invoke the RSE by writing a simple recursive algorithm to calculate the sum of

1/2**n 

double recursive(double x) 

{ 

double temp, epsilon=0.001 

temp=x/2. 

if(temp < epsilon) return 0.0 

return temp+recursive(temp) 

} 

 

The above function calls itself recursively 10 times. If you compile the function with the /Fa (Windows*) option, it will yield an assembler listing that you can edit. Modify the alloc statement from:

alloc r33=ar.pfs,1,2,1,0

to one that allocates 66 local registers at each level:

alloc r33=ar.pfs,1,65,1,0

You can verify using the BE_RSE_Bubble.All counter that the RSE is invoked at each recursive level. It consumes approximately one-third of the total cycles.

It must be emphasized that heavy usage of general registers is the cause for invoking the RSE. If BE_RSE_Bubble contributes significantly to CPU_Cycles, then follow one of the following strategies:

  • Simplify the algorithms so they do not use so many general registers.
  • Change the compiler options.

 

As indicated in the example (even though it is very artificial), using recursive algorithms in CPU-intensive parts of the program can result in very deep call stacks, creating the need to free up registers. Recursion should be avoided in kernels on any architecture, as winding and unwinding deep call stacks is never very efficient.

For other strategies to remove RSE activity, see Chapter 9 of the manual listed below.


Source

Introduction to Microarchitectural Optimization for Itanium® Processors

 


For more complete information about compiler optimizations, see our Optimization Notice.