Cache Blocking Technique on Hyper-Threading Technology Enabled Processors

by Phil Kerly

Introduction

Hyper-Threading Technology-enabled processors contain multiple logical processors per physical processor package. The state information necessary to support each logical processor is replicated while sharing the underlying physical processor resources. Given that processor resources are generally underutilized by most applications, Hyper-Threading Technology-enabled processors can improve overall application performance. Multiple threads running in parallel can achieve higher utilization and increased throughput. Since logical processors share underlying physical processor resources, interactions between multiple threads utilizing those resources can have synergistic affects.

The various levels of cache are one such shared physical processor resource. This paper describes the various levels of cache within Intel NetBurst® microarchitecture, how effective use of data locality using a cache data blocking technique can improve application performance, and how to optimize this technique for Hyper-Threading Technology-enabled processors. Note that this paper is not targeted to a specific processor so various cache sizes and specific implementations described in this document are subject to change in future processors. At the time this article is being published, Hyper-Threading Technology is enabled only in Intel® Xeon® processors, but Intel announced at the Spring Intel Developer Forum that this technology will be available on desktop processors sometime in 2003.

Intel NetBurst Micro-architecture: Cache

Intel NetBurst microarchitecture has two levels of cache (see Figure 1). Note that the Intel® Xeon® processor MP has an additional 3rd level cache, not shown in the figure. The first level cache is separated into an execution trace cache and a first level data cache. The execution trace cache stores previously decoded micro-ops to remove decoder latencies from main execution loops. The trace cache, as well as the first level data cache, is closely coupled to the processor pipeline. The second level Advanced Transfer Cache ensures a steady supply of instructions to the front-end and data to the execution pipeline through the first level data cache.

Figure 1: Intel NetBurst MicroArchitecture

In the current Intel Xeon processor, all levels of cache are a shared resource between logical processors within a physical Hyper-Threading Technology-enabled processor package. Although the execution trace cache is a shared resource between logical processors, individual traces within the execution trace cache are tagged to specific logical processors so that traces from one logical processor are not used by the other.

The 1st level data cache is virtually addressed and is also shared between logical processors. The 2nd level and 3rd level cache, if available, are a unified cache for both data as well as instructions. They are physically addres sed and is a shared resource as well. Separate threads of execution within an application as well as separate applications accessing the same physical address will resolve to the same cache lines. Note that the implementation of the trace cache, 1st level data cache and 2nd/3rd level unified cache, as described are specific to the Hyper-Threading Technology-enabled Intel Xeon and Xeon MP processors. The implementation on future Hyper-Threading Technology-enabled processors is subject to change.


Cache Blocking Technique

There are many factors that impact cache performance. Effective use of data cache locality is one such significant factor. And the well known data cache blocking technique is used to take advantage of data cache locality. The cache blocking technique restructures loops with frequent iterations over large data arrays by sub-dividing the large array into smaller blocks, or tiles. Each data element in the array is reused within the data block, such that the block of data fits within the data cache, before operating on the next block or tile.

Depending on the application, a cache data blocking technique is very effective. It is widely used in linear algebra and is a common transformation applied by compilers and application programmers. Since the 2nd level unified cache contains instructions as well as data, compilers often try to take advantage of instruction locality by grouping related blocks of instructions close together as well. Typical applications benefiting from cache data blocking are image or video applications where the image can be processed on smaller portions of the total image or video frame. But the effectiveness of the technique is highly dependent on the data block size, the processor's cache size, and the number of times the data is reused.

By way of example, a sample application is provided to demonstrate the performance impact of this technique (see Appendix A). Figure 2 shows the results of cache blocking with varying block sizes on the sample application. At the sweet spot around 450-460 KB tiles size matches very closely with unified L2 cache size, the application almost doubles in performance. This is only an example and the block size sweet spot for any given application will vary based on how much of the L2 cache is used by other cached data within the application as well as cached instructions from the application. Typically, an application should target the block size to be approximately one-half to three-quarters of the cache size. In general, it's better to err on the side of having too small of a block size than too large.

Additionally, the data cache blocking technique performance scales well with multiple processors if the algorithm is threaded for data decomposition. Fortunately, the fact that each block of data can be processed independently with respect to other blocks lends itself to being decomposed into separate blocks which can be processed in separate threads of execution. Figure 2.0 also shows the performance improvement of the cache blocking algorithm for two threads running on a dual processor system with two physical processors. The performance curve for two threads matches v ery closely the performance curve for a single processor system with the sweet spot for the block size at around 450-460 KB per thread but at approximately twice the performance. Assuming there is very little synchronization necessary between the two threads as in this example, it's reasonable to expect that the block size sweet spot would not vary significantly. Both processors have independent cache of equal size. In this case, both processors have 512KB of L2 cache available.



Figure 2: Effects of Cache Block Size on Performance

Since the threaded data cache blocking technique can provide significant performance opportunities on multi-processor systems, applications should detect the number of processors in order to dynamically create a thread pool so that the number of threads available for the cache blocking technique matches the number of processors. On Hyper-Threading Technology-enabled Intel processors, the application should detect the number of logical processors supported across all physical processors in the system. Be aware that a minimum block size should be established such that the overhead of threading and synchronization does not exceed the benefit from threading your algorithm.

Actual performance for a given application depends on the relative impact of L2 cache misses and their associated memory latencies induced without cache blocking. For an application that has significant execution time relative to memory latencies, the performance impact will also be reduced.

Intel processors support the capability to determine cache information including cache size. Cache information is available on the following processors:

  • Pentium® II processor
  • Pentium® Pro processor
  • Pentium® III processor
  • Celeron® processor
  • Pentium® 4 processor
  • Xeon® processor
  • Xeon® MP processor

 

Refer to the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 2A for information on how to determine the processor family and instructions on how to gather cache size information using the CPUID instruction. Applications can use this information to dynamically adjust cache block sizes targeting approximately one-half to three-quarters the size of the physical cache and maximum application performance across various processors not supporting Hyper-Threading Technology at run-time.


Cache Block Size on Hyper-Threading Technology Processors

With the introduction of Hyper-Threading Technology in the Intel® Xeon® processor in which the cache is shared between logical processors, the relationship between block size and cache size holds. But the relationship is relative to the number of logical processors supported by the physical processor as we ll. Therefore, performance curve on a Hyper-Threading Technology-enabled processor is different. Figure 3 compares the performance between a single processor with one thread, a Hyper-Threading Technology-enabled processor with two threads and a dual processors system with two threads. The application performance is significantly faster than the single processor version for even large block sizes per thread greater than the cache size. Threading in this case helps hide the data dependency latencies between instructions. More loads can be issued simultaneously and more instructions can retire per clock.

But once L2 cache misses and memory latencies are no longer an issue for the single processor version, the performance on a Hyper-Threading Technology-enabled processor degrades versus the single processor version. That's because the total block size across all threads still exceeds the shared cache size. Now, L2 cache misses and memory latencies are penalizing the Hyper-Threading Technology processor where as the data blocks of the single processor now fit in L2 cache. Once the block size reaches the sweet spot on Hyper-Threading Technology-enabled processors, it begins to out perform the single processor performance even at the ideal sweet spot block size for a single processor. For this example, note that the block size sweet spot for Hyper-Threading Technology-enabled processors is ~125-133 KB which is a little more than one quarter the block size sweet spot for the single and dual processor. In general, applications running on a Hyper-Threading Technology-enabled processor supporting two logical processors should target a cache block size of one-quarter to one-half the actual cache size of the physical processor. Again, it's better to err on the side of having too small of a block size than too large.



Figure 3: Effects of Cache Block Size on Hyper-Threading Technology Enabled Processors

If you are already using the cache blocking technique and your application degrades on a Hyper-Threading Technology-enabled processor versus single processor system, then this may be the source of your degradation. If the number of threads created times the data block size exceeds the known cache size, then you have your first clue that this may be the source of performance degradation. You can also use Intel® VTune™ Performance Analyzer to collect data on L2 or L3 Cache Read Misses on single processor, dual processor, and Hyper-Threading Technology-enabled processors to do a comparative analysis. If the L2 or L3 Cache Read Misses event count is significantly different between the systems, then further investigation into cache block size is warranted. If the L2 or L3 Cache Read Misses event count approaches the L2 or L3 Cache Read Misses without cache blocking, then this is a likely source of the performance degradation.


Conclusion

Hyper-Threading technology-enabled processors can improve overall application performance. Multiple threads running in parallel can achieve higher utilization and increased throughput. Since logical processors share underlying physical processor resources, interactions between multipl e threads utilizing those resources can have synergistic affects. The cache hierarchy in Intel Xeon processors is one such shared resource.

Applications should be designed or modified to effectively take advantage of shared cache resources on Hyper-Threading Technology-enabled processors. The cache blocking technique is one very effective way to take advantage of data locality and can improve overall applications performance on single, multiple or Hyper-Threading Technology-enabled processors. Applications should detect the data cache size using Intel's CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance across processor implementations. In addition, applications should detect the total number of logical processors for the system in order to dynamically create a thread pool so that the number of threads available for the cache blocking technique matches the number of processors. Be aware that a minimum block size should be established such that the overhead of threading and synchronization does not exceed the benefit from threading. Cache block sizes should target approximately one-half to three-quarters the size of the physical cache for non-Hyper-Threading Technology processors and one-quarter to one-half the physical cache size for a Hyper-Threading Technology-enabled processor supporting two logical processors.

References

The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.

 


Appendix A: Sample Source Code

The sample code creates two threads. Each thread has its affinity set to a particular processor. The threads wait on an Event object, waiting for the main program to setup the parameters before setting the Event object. The main thread then waits for both threads to complete its operation and setting another Event object, signaling its completion of the block operation. The block size that each thread can operate on as well as the number of iterations on the block is parameterized. The actual operation on the block in the sample code is arbitrary and used just for illustration. In this case, each element of the array is added to itself and a constant value which is totaled across all elements in the array which is repeated a number of times as defined by the iteration variable. The code for this operation is in the "CacheBlocking" function. The important features of this sample application are the method of constructing the threads, method of synchronization between the t hreads and the main program, and the method of block size parameterization. These techniques could be used in applications and allow dynamic tuning of the cache blocking technique based on the processor cache size, number of physical processors, and number of logical processors per physical processor.

//Microsoft ® 32-bit C/C++ Optimizing Compiler Version 12.00.8804 for 80x86

//Copyright (C) Microsoft Corp 1984-1998. All rights reserved.


cl /ML /W3 /GX /O2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /Fo"Release/" /Fd"Release/" /FD /c LoopTestLoopTest.cpp"


// LoopTest.cpp : Defines the entry point for the console application.

//


#include "stdio.h"

#include "malloc.h"

#include "windows.h"

#include "MMsystem.h"


typedef unsigned int uint;

typedef unsigned __int64 uint64;

DWORD start, stop;

void get_time_start()


{


start = timeGetTime();


}


DWORD get_time_stop()


{


DWORD time;

stop = timeGetTime();

time = DWORD(stop - start);

return time;


}


struct ThreadParams


{


HANDLE hThread;

HANDLE hStart;

HANDLE hEnd;

uint* data;

uint results;

uint id;

uint array_sz;

uint block_sz;

uint iterations;

uint padding[64]; 
// padding to avoid false sharing


} 


thread_parameters[2];


HANDLE hEvents[2];


void CacheBlocking (uint* results, uint* array, uint ARRAY_SZ, uint BLOCK_SZ, uint ITERATIONS)


{


uint sum = 0;

uint index = 0;

unsigned int i, j;

for (index = 0; index < ARRAY_SZ;) {

uint* data = &array[index];

index += BLOCK_SZ;

if (index > ARRAY_SZ) BLOCK_SZ = ARRAY_SZ - (index - BLOCK_SZ);

for (i = 0; i < ITERATIONS; i++)

for (j = 0; j < BLOCK_SZ; j++)

sum += data[j]+data[j] + ITERATIONS;


}


*results = sum;


}


void NonCacheBlocking (uint* results, uint* array, uint ARRAY_SZ, uint ITERATIONS)

{


uint sum = 0;

unsigned int i, j;

for (i = 0; i < ITERATIONS; i++)

for (j = 0; j < ARRAY_SZ; j++)

sum += array[j]+array[j] + ITERATIONS;

*results = sum;


}


DWORD WINAPI ThreadFunction(void *vThread)


{

struct ThreadParams *p = (struct ThreadParams *)vThread;


// software fix for 64K/1M aliasing


int *stackoffset = (int *)_alloca(512*p->id);

while (1) {

WaitForSingleObject (p->hStart, INFINITE);

CacheBlocking(&(p->results), p->data, p->array_sz,

p->block_sz, p->iterations);

ResetEvent (p->hStart);

SetEvent (p->hEnd);

}


return 0;


}


void InitializeThreads ()


{

thread_parameters[0].hThread = (HANDLE)CreateThread (NULL, 
// security

0, 
// default statck_size 


(LPTHREAD_START_ROUTINE) ThreadFunction, 
// function

&(thread_parameters[0]), 
// arglist

CREATE_SUSPENDED, 
// initflag

NULL); 
// threadaddr

thread_parameters[0].hStart = (HANDLE)CreateEvent (NULL, TRUE,

FALSE, NULL);

thread_parameters[0].hEnd = (HANDLE)CreateEvent (NULL, TRUE,

FALSE, NULL);

thread_parameters[0].id = 0;

SetThreadAffinityMask (thread_parameters[0].hThread, 1 << 0);

thread_parameters[1].hThread = (HANDLE)CreateThread (NULL, // security

0, // default statck_size

(LPTHREAD_START_ROUTINE) ThreadFunction, 
// function

&(thread_parameters[1]), 
// arglist

CREATE_SUSPENDED, 
// initflag
NULL); 
// threadaddr

thread_parameters[1].hStart = (HANDLE)CreateEvent (NULL, TRUE,

FALSE, NULL);

thread_parameters[1].hEnd = (HANDLE)CreateEvent (NULL, TRUE,

FALSE, NULL);

thread_parameters[1].id = 1;

SetThreadAffinityMask (thread_parameters[0].hThread, 1 << 1);

hEvents[0] = thread_parameters[0].hEnd;

hEvents[1] = thread_parameters[1].hEnd;

ResumeThread(thread_parameters[0].hThread);

ResumeThread(thread_parameters[1].hThread);


}


void TimeThreadedCacheBlocking (uint* array, uint ARRAY_SZ, uint BLOCK_SZ, uint ITERATIONS)


{


uint array_sz = ARRAY_SZ/2;

uint sum = 0;

get_time_start();

thread_parameters[0].data = array;

thread_parameters[0].array_sz = array_sz;

thread_parameters[0].block_sz = BLOCK_SZ;

thread_parameters[0].iterations = ITERATIONS;

SetEvent(thread_parameters[0].hStart);

thread_parameters[1].data = &(array[array_sz]);

thread_parameters[1].array_sz = array_sz;

thread_parameters[1].block_sz = BLOCK_SZ;

thread_parameters[1].iterations = ITERATIONS;

SetEvent(thread_parameters[1].hStart);

WaitForMultipleObjects(2,

hEvents,

TRUE,

INFINITE);

ResetEvent(thread_parameters[0].hEnd);

ResetEvent(thread_parameters[1].hEnd);

sum = thread_parameters[0].results + thread_parameters[1].results;

printf ("%u msec	", get_time_stop());

printf ("Block Size: %u K	", BLOCK_SZ*sizeof(uint)/1024);

printf ("Results: %u", sum);


}


void TimeCacheBlocking (uint* array, uint ARRAY_SZ,

uint BLOCK_SZ, uint ITERATIONS)


{


uint sum = 0;

get_time_start();

CacheBlocking (&sum, array, ARRAY_SZ, BLOCK_SZ, ITERATIONS);

printf ("%u msec	", get_time_stop());

printf ("Block Size: %u K	", BLOCK_SZ*sizeof(uint)/1024);

printf ("Results: %u", sum);


}


void TimeNonCacheBlocking (uint* array, uint ARRAY_SZ, uint ITERATIONS)

{


uint sum = 0;

get_time_start();

NonCacheBlocking (&sum, array, ARRAY_SZ, ITERATIONS);

printf ("%u msec	", get_time_stop());

printf ("Block Size: 0 K		");

printf ("Results: %u", sum);


}


int main(int argc, char* argv[])


{


uint ITERATIONS=1000;

uint ARRAY_SZ=4096000;

uint* array = (uint*) malloc(sizeof(uint)*ARRAY_SZ);

SetThreadAffinityMask (GetCurrentThread(), 2);

for (unsigned int i = 0; i < ITERATIONS; i++)

for (unsigned int j = 0; j < ARRAY_SZ; j++)

array[j] = 3;

printf ("No Cache Blocking");

TimeNonCacheBlocking (array, ARRAY_SZ, ITERATIONS);

printf ("Single Threaded Cache Blocking");

TimeCacheBlocking (array, ARRAY_SZ, 204800, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 136534, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 117029, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 102400, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 68267, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 34134, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 25600, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 12800, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 6400, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 3200, ITERATIONS);

TimeCacheBlocking (array, ARRAY_SZ, 1600, ITERATIONS);

InitializeThreads();


printf ("2 Threads Cache Blocking");

TimeThreadedCacheBlocking (array, ARRAY_SZ, 204800, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 136534, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 117029, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 102400, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 68267, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 34134, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 25600, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 12800, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 6400, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 3200, ITERATIONS);

TimeThreadedCacheBlocking (array, ARRAY_SZ, 1600, ITERATIONS);

free(array);

return 0;
}

 


About the Author

Phil Kerly is a Senior Software Engineer with Intel's Software and Solutions Group, and has been with Intel for 11 years. Over the last 5 years, Phil worked on various software application optimization efforts for the Pentium® processor family and the Itanium processor family. He is currently developing methods to optimize applications for Hyper-Threading Technology- enabled processors. Phil also chairs Intel's team to define best known methods to improve application performance on Hyper-Threading Technology-enabled processors. He is a frequent contributor to Intel Developer Zone, and holds a master's degree in Computer Science from the University of Delaware, where he also received bachelor's degrees in Computer Science and Chemical Engineering.


Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.