Adjusting Thread Stack Address To Improve Performance on Intel® Xeon® Processors

by Phil Kerly
Senior Software Engineer
Intel Corporation, Architecture Performance Engineering

Introduction

Intel® Xeon® processors with Hyper-Threading Technology enabled contain multiple logical processors per physical processor package, which facilitates simultaneous multi-threading on a single processor. The state information necessary to support each logical processor is replicated while sharing the underlying physical processor resources. Given that processor resources are generally under-utilized by most applications, processors enabled with Hyper-Threading Technology can improve overall application performance. Multiple threads running in parallel can achieve higher utilization and increased throughput. But given the nature of the shared physical resources, multiple threads vying for the same physical processor resources can impact your application's performance.

A Shared Resource: First Level Data Cache Address Resolution Unit

In most instances, sharing this resource is beneficial for overall application performance. But in a few cases, sharing this resource can negatively affect your performance if you have conflicts between two threads. One such case is for data domain decomposition, which involves subdividing tasks into smaller tasks then performing the identical operations on them. In particular, threads created for data domain decomposition are susceptible to cache line alias conflicts in the first level data cache.

When two data virtual addresses reside on cache lines that are modulo 64 KB apart, competition for the same cache line in the first level data cache occurs, which is referred to as a 64KB alias conflict. This can affect both the first level data cache performance and can impact the branch prediction unit.

In addition to 64 KB alias conflicts, it is possible that branch miss predictions will increase when the logic uses speculative data with addresses modulo 1-megabyte (MB) apart. Under Microsoft Windows* operating systems, thread stacks are currently created on a multiple of 1-MB boundaries by default. Two threads with very similar stack frame images and access patterns to local variables on the stack are very likely to cause alias conflicts resulting in significant performance degradation. This kind of performance degradation has been measured on real-world applications. Future implementations of the Intel Xeon processor with Hyper-Threading Technology will likely address both sources of alias conflicts.

But for now...

In the meantime, if you encounter performance degradation, a simple work-around exists. By adjusting the initial thread stack address of each of your threads, you can restore significant performance to your application on Intel Xeon processors with Hyper-Threading Technology. This work-around is explained in the next section, Adjusting Initial Thread Stack Address.

There are two ways to determine if your application performance on Hyper-Threading Technology enabled processors is suffering from these alias conflicts. The first, most definitive method is to try the suggested work-around across your application's performance workloads. By comparing the r esulting performance with and without Hyper-Threading Technology enabled, you can directly measure the relative performance difference. The second method is to use Intel's VTune™ Performance Analyzer. You can find out more about this product by visiting the Intel® Software Development Products Web site.

For using the first method, you will need to collect both clock tick events as well as 64-KB alias conflict events across your application's performance workloads with and without Hyper-Threading Technology enabled. After sorting the modules and functions in your application by clock ticks from highest to lowest, compare the number of 64KB alias events. It's not unusual to see an increase on the order of 3 times the number of 64-KB alias events with Hyper-Threading technology enabled. However, applications with a difference of 8 times or greater the number of 64-KB alias events at a module or function level, generally show significant performance improvement at the module or function level using the optimization described in this paper. If a significant portion of the total execution time is spent in the module or function, this will translate directly to overall application level performance improvement.

Note that enabling or disabling Hyper-Threading support in Intel Xeon processors requires support in the systems BIOS. Some BIOS implementations between vendors may not support user level access to enable or disable the Hyper-Threading Technology feature.


Adjusting Initial Thread Stack Address

Typically, threads are created using an application programming interface specific to a particular operating system, then passing it a pointer to a function as well as a pointer to a block of data specific to the thread. The key to adjusting the initial thread stack address is to replace the original function pointer with an intermediate function that can adjust the stack by a variable amount, depending on the number of threads created. See the Example Code below for an illustration of this. A new, intermediate parameter block is needed that contains a pointer to the original thread function, a thread id, and a pointer to the original parameter data block. The intermediate function can then call the original function passing on the original, thread-specific parameter data. Using the new parameter block with a function pointer is a generic implementation that can be used for a pool of threads that may need to invoke different functions for a thread.

Using an alternative method, you could avoid the function pointer technique and have the intermediate function call the original function directly although this would be less generic. However, be careful that the compiler does not in-line the original thread function within the alternative thread function. If the original thread function is 'in-lined' within the alternative function, you will lose the benefit of the adjusted stack address for the original function. Using the intermediate function method with a function pointer avoids this possibility because the compiler cannot determine which function to in-line at compile time.

The easiest way to adjust the initial stack address for each thread is to call the memory allocation function, _allo ca, with varying byte amounts in the intermediate or alternative thread function. The _alloca function allocates memory directly on the stack. By adjusting the number of bytes passed to the _alloca function, you can adjust the next function's starting stack address. The _alloca function is found in the malloc.h header file. Using this technique to adjust the stack address allocates virtual memory in each thread's stack frame that will go unused. In the sample code in Appendix A, a 1K-byte offset multiplied by thread_number is used to vary the stack frame starting address. 1KB is not a magic number but one that has generally worked across various applications. However, the Microsoft Windows* operating system currently has a limit on the amount of virtual memory accessible to a given process. If the limit on virtual memory is an important consideration for your application, you will need to determine the best offset or modify this technique within this constraint.

Example Code: Original, Intermediate, and Alternative thread functions

      DWORD WINAPI OriginalThreadProc (LPVOID ptr)
      
{
         
// This would have been the original thread      function
         
return 0;
      
}

      
DWORD WINAPI IntermediateThreadProc (LPVOID ptr)
      
{
         
struct FunctionBlk* parameter = (struct      FunctionBlk*) ptr;
         
// Adjusting stack address
         
_alloca (parameter->thread_number *      STACK_OFFSET);
         
return      (*parameter->ThreadFuncPtr)(parameter->function_parameters);
        }

      
DWORD WINAPI AlternateThreadProc (LPVOID ptr)
      
{
         
struct FunctionBlk* parameter = (struct      FunctionBlk*) ptr;
         
_alloca (parameter->thread_number *      STACK_OFFSET);
         
return OriginalThreadProc      (parameter->function_parameters);
      
}

 

When creating threads to sub-divide a problem among mu ltiple processors, consider using the main thread to do a portion of the work. The main thread is already likely to have a very different stack frame image and data access pattern from the child threads that start with a clean stack frame aligned on a 1-MB boundary. Plus, there is one less thread to synchronize and manage. Note that this may not be desirable if the main thread must manage other tasks or be responsive to user input.

Go to the end of this article for a complete source code listing that illustrates this basic technique.

Summary

The Intel Xeon processor with Hyper-Threading Technology shares the first level data cache among logical processors. Two data-virtual addresses that reside on cache lines modulo 64 KB apart will conflict for the same cache line in the first level data cache. This can affect both the first level data cache performance as well as impact the branch prediction unit. This alias conflict is particularly troublesome for applications that create multiple threads to perform the same operation but on different data. Subdividing the tasks into smaller tasks performing the identical operation is often referred to as data domain decomposition. Under Microsoft Windows operating systems, threads are created on 1-MB boundaries. Threads performing similar tasks and accessing local variables on their respective stacks will encounter address alias conflict conditions resulting in significant, overall application performance degradation. By adjusting the individual threads' starting stack address by a variable amount as described in this application note, an application can recover the degraded performance and will likely show overall application performance improvement on Intel Xeon processors with Hyper-Threading Technology. Note that this is a work-around. Future Intel Xeon processors with Hyper-Threading Technology will likely address this limitation.

For more information:

Read about Per Processor Licensing: "Multi-core processors raise software licensing questions"
Visit the Intel® Xeon® processor and Intel® Multi-Core Developer Community developer areas to learn about how to integrate the benefits of these technologies into your applications now.


Sample Code

The following source code was compiled with Microsoft's Visual C++ 6.0® compiler.

#include "windows.h"
      
#include "malloc.h"

      
#define NUM_THREADS      4
      
#define      STACK_OFFSET      1024
      
// Pentium 4 2nd Level Cache Line Size
      
#define      CACHE_LINE_SZ      128

      
struct ParameterBlk {
         
int thread_specific_data;
         
char      padding[2*CACHE_LINE_SZ-sizeof(int)];      
           
// Keep padding to this data at least a cache line apart
         
// Or copy the data to local variables within the thread
         
// Repeated access to these variables may cause cache thrashing
     
 };

      
typedef DWORD (*PFI) (void*);
      
// function block that contain arguments that will be provided to each thread
      
struct FunctionBlk {
         
PFI ThreadFuncPtr;
          
struct ParameterBlk*      function_parameters;
         
unsigned int thread_number;
         
// Keep padding to this data at least a cache line apart
         
// Or copy the data to local variables within the thread
         
// Repeated access to these variables may cause cache thrashing
         
// If we can't enforce the arrary to start on a cache line boundary
         
// then we should pad to 2*CACHE_LINE_SZ ==      sizeof(struct FunctionBlk)
         
char padding[2*CACHE_LINE_SZ      -sizeof(PFI)
            
-sizeof(struct      ParameterBlk*)
            
-sizeof(unsigned      int)];   
     
};

      
struct FunctionBlk gThreadParameters[NUM_THREADS];
      
struct ParameterBlk gOriginalParameters[NUM_THREADS];

      
DWORD WINAPI OriginalThreadProc (LPVOID ptr)
      
{
         
// This would have been the original thread function
         
return 0;
 
}

      
DWORD WINAPI IntermediateThreadProc (LPVOID ptr)
      
{
         
struct FunctionBlk* parameter = (struct FunctionBlk*) ptr;
         
// Adjusting stack address
         
_alloca (parameter->thread_number *      STACK_OFFSET);
         
// Calling original thread procedure using a function pointer
         
// You could call the function directly but be careful that
         
// the function doesn't get inlined.
         
return      (*parameter->ThreadFuncPtr)(parameter->function_parameters);
        }

      
void main (void)
      
{
         
int i;
         
for (i =0; i < NUM_THREADS - 1; i++)
         
{
            
gThreadParameters[i].ThreadFuncPtr = (PFI) OriginalThreadProc;
            
// pointer to the original parameters
            
gThreadParameters[i].function_parameters = &(gOriginalParameters[i]);
            
gThreadParameters[i].thread_number = i;

            
// This is original create thread call
            
// CreateThread (NULL, 1024000, OriginalThreadProc, (void*

// & (gOriginalParameters[i]),

    
0, NULL);
             
// If you specify the initial stack memory commit size (2nd parameter),
            
// you may want to adjust the commit size by the offset amount
             
// as well. Minimum stack size for Microsoft® Windows® OS is 1MB.
            
CreateThread (NULL, 1024000, IntermediateThreadProc, (void*

& (gThreadParameters[i]),

    
0, NULL);   
         
}

         
// Use the main thread as one of the child threads if possible
         
// pointer to the original parameters
         
gThreadParameters[NUM_THREADS -      1].function_parameters =

& (gOriginalParameters[NUM_THREADS

    
- 1]);
         
// We don't need the intermediate thread function. The main thread
         
// is likely to have a different stack frame image and access pattern
         
// already.
         
OriginalThreadProc (
            
(void*)      gThreadParameters[NUM_THREADS - 1].function_parameters);
      }

 


Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.
Tags: