Software Techniques for Shared-Cache Multi-Core Systems

by Tian Tian, Technical Marketing Engineer
and Chiu-Pi Shih, Technical Marketing Engineer


Introduction

Design and fine-tune applications for better performance

As the trend moves from single-core to multi-core processors for the next computing performance leap, system architectures have several options for the organization of one of the most important system resources-the cache. Some architectures choose to keep the last-level cache private to each core for simplicity, while other architectures explore sharing the last-level cache among different cores for better performance/cost ratio and improved resource allocation.

As an example, the Intel® Core™ Duo processor is one of the first processors to introduce this shared-cache architecture into a multi-core environment with its Intel® Smart Cache offerings. This organization allows both cores to have access to the entire 2MB last-level cache (in this case it is L2 cache), reducing possible resource underutilization. The enhanced data pre-fetch logic improves efficiency by pre-fetching data to the L2 cache even before cache requests occur. Intel Smart Cache also features a new power-saving mechanism that enables the L2 Intel Smart Cache to dynamically flush its ways into system memory, based on demand, or during periods of inactivity.

The shared cache architecture brings exciting new opportunities to software and system designers. This document first describes the overall benefits of the architecture and answers why it is generally better, as compared to architectures that provide individual caches for each of the processors in the system. In addition, this document captures software architecture and programming techniques to take advantage of this new hardware feature, helping users design and fine-tune applications for better performance. Without proper usage of these techniques, the benefits of the shared cache architecture may not get fully utilized.

Benefits of the shared cache architecture

Figure 1 shows two processors (processor 0 and processor 1) sharing the same system bus and system memory. Inside each processor there are two CPU cores; each has its own private L1 cache while sharing the L2 cache. The benefits of such a shared cache system are many:

  • Efficient usage of the last-level cache
    • If one core idles, the other core takes all the shared cache
    • Reduces resource underutilization
  • Flexibility for programmers
    • Allows more data-sharing opportunities for threads running on separate cores that are sharing cache
    • One core can pre-/post-process data for the other core
    • Alternative communication mechanisms between cores
  • Reduce cache-coherency complexity
    • Reduced false sharing because of the shared cache
    • Less workload to maintain coherency, compared to the private cache architecture
  • Reduce data-storage redundancy
    • The same data only needs to be stored once
  • Reduce front-side bus traffic
    • Effective data sharing betw een cores allows data requests to be resolved at the shared-cache level instead of going all the way to the system memory

 

Figure 1 Shared Cache Example: A dual-core, dual-processor system

 

For hardware-specific benefits such as Intel Smart Cache power savings and pre-fetch logic features, please refer to Ref [4]. This article will focus on the benefits of this architecture – mostly from a software perspective. The discussions throughout this white paper use a dual-core/dual-processor system (Figure 1) as the example. In this case, the shared last-level cache is L2 cache. However, the benefits and software techniques described in this article apply generically to other shared-cache systems. There are challenges to make best use of a shared cache, and it will be addressed in the following sections.


Software techniques for shared-cache multi-core systems

There are quite a few well-known techniques for using cache effectively. In this article we will focus on those that are particularly relevant to multi-core systems with the shared cache architecture described in the previous section.

Processor Affinity

In a multi-core environment, the control over which core to run a specific thread or application on is essential. Without this control, the threads/applications may get assigned to the wrong processors or cores and cause unnecessary performance degradation.

It may be worth it to consider separating the contention-only threads (threads that are not sharing any data but need to use cache) from each other, and assigning them to cores that are not sharing the cache. On the other hand, it is also important to consider assigning data-sharing threads to cores that are sharing the cache, or even assigning them to the same core if they are closely tied to each other.

In different operating systems, the functions to control process affinity may be different.

Windows* example:

/* Set processor affinity */
BOOL WINAPI SetProcessAffinityMask(Handle hProcess, DWORD_PTR dwProcessAffinityMask);

/* Set thread affinity */
DWORD_PTR WINAPI SetThreadAffinityMask( Handle hThread, DWORD_PTR dwThreadAffinityMask);

 

Linux* example:

/* Get the CPU affinity for a task */
extern int sched_getaffinity (
pid_t pid, size_t cpusetsize, cpu_set_t *cpuset);

/* Set the CPU affinity for a task */
extern int sched_setaffinity (
pid_t pid, size_t cpusetsize,
const cpu_set_t *cpuset);

 

Cache blocking (data tiling)

The cache blocking technique (also called data tiling) tries to allow data to stay in the cache while being processed by data loops. By reducing the unnecessary cache traffic (fetching and evicting the same data throughout the loops), a better cache hit rate is achieved.

Possible candidates for the cache blocking technique include large data sets that get operated on many times. By going through the entire data set and performing one operation, followed by another pass for the second operation, and so on, if the entire data set does not fit into the cache, the first elements in the cache will be evicted to fit the last elements in. Then, on subsequent iterations through the loop, loading new data will cause the older data to be evicted, possibly creating a domino effect where the entire data set needs to be loaded for each pass through the loop. By sub-dividing the large data set into smaller blocks and running all of the operations on each block before moving on to the next block, there is a likelihood that the entire block will remain in cache through all the operations.

Cache blocking sometimes requires software designers to “think outside the box” and choose the flow or logic that is not the most obvious or natural implementation, but is more beneficial to the cache utilization.

Figure 2: Improve cache hit rate with the cache blocking technique

 

The block on the left of Figure 2 shows a data loop over a large data set with a size greater than the cache size. In order to complete the loop once, some data that is fetched into the cache in the beginning of the loop will have to be evicted in order to process the later portion of the data. For the next iteration of the loop, the data that has been evicted early will need to be re-fetched into the cache. As a result, it generates more traffic in and out of the cache, and introduces more cache request misses.

The flow shown on the right of Figure 2 is broken down into four data loops over smaller chunks of the data set, with each chunk having a size smaller than the cache size. As a result, during each loop, the smaller chunk is able to stay in the cache longer, resulting in less traffic into and out of the cache, and reduced cache-request miss ratio. The example described here is probably the simplest cache blocking technique. There are more advanced cache blocking schemes available, depending on the nature of the applications and algorithms.

Hold approach

The hold approach involves reducing the frequency of access to the shared data by letting each thread maintain its own private copy of data, only updating the shared copy when it is necessary.

Figure 3: Holding the private copy and only updating the shared copy when necessary

 

In Figure 3 A, thread 0 and thread 1 update the shared data copy frequently, resulting in more traffic in and out of the shared cache; while in Figure 3 B, the threads are more disciplined in terms of accessing the shared copy by only updating it when necessary. One way to achieve this is to keep a private data copy for tracking until updates have to be done to the shared copy. A practical example is to use the OpenMP* reduction clause so that each thread keeps a private copy and only consolidates the results when all threads are complete.

Delayed approach

The delayed approach takes advantage of the natural L1 cache eviction by inserting a wait until data gets pushed to the shared L2 cache – before the other core requests access to it. When two threads are running on two cores that are sharing the L2 cache and the threads are sharing the data, it is possible to implement such a scheme to fine-tune the performance.

The delayed approach takes advantage of the natural L1 cache eviction by inserting a wait until data gets pushed to the shared L2 cache – before the other core requests access to it. When two threads are running on two cores that are sharing the L2 cache and the threads are sharing the data, it is possible to implement such a scheme to fine-tune the performance.

Figure 4: Conceptual flow of the delayed approach

 

Shown in Figure 4 A is a design that uses the shared L2 cache to implement a simple data-sharing between thread 0 on core 0, and thread 1 on core 1. Detailed operations for Figure 4 A are:

  • Step 1 and 2: thread 0 grabs data into L1 cache and modifies it
  • Step 3: thread 0 sends a signal to thread 1 to tell it the data is ready
  • Step 4: thread 1 checks if the data is in L2 cache, gets a miss
  • Step 5: data is evicted into L2 cache
  • Step 6: thread 1 gets data

 

By introducing a delay in Figure 4 B, a portion of the data processed by thread 0 may have been evicted into L2 cache. As a result, the overall cache hit rate may get improved. Detailed operations for Figure 4 B are:

  • Step 1 and 2: thread 0 grabs data into L1 cache and modifies it
  • Step 3: instead of sending a signal immediately, thread 0 waits until a portion of the processed data is evicted into L2 cache. The amount of delay needed is determined by L1 cache usage and data size
  • Step 4: thread 0 sends a signal to thread 1 for processing
  • Step 5: thread 1 asks for data
  • Step 6: at least a portion of the data is already in L2 cache, which results in a better L2 hit rate

 

In reality, the environment is noisy and there may be multiple contenders for the caches. It may be difficult to get a crisp control of the delay needed. This techniq ue is recommended for the applications so that designers have control on what threads are running, and is used to fine-tune the applications and squeeze out some possible cycles.

Avoid false sharing

False sharing is a well-known problem in multiprocessor environments. It happens when threads on different processors write to a shared cache line, but not at the same location. Since processors write to different locations, there is no real coherency problem, however, the cache-coherency protocol marks the cache line dirty, and when the other location gets a read/write request the hardware logic will force a reload of a cache-line update from memory. Frequent updates of the data in the shared-cache line could cause severe performance degradation. As a result, false sharing should be avoided, particularly in areas of high processing.

Although the shared-cache architecture has reduced cache-coherency issues at the shared-cache level, the multi-core system overall may still have false sharing issues. For example, to maintain cache coherency in the private L1 cache, false sharing could happen. As illustrated in Figure 5, thread 0 in core 0 brings a cache line to L1 and modifies a portion of it. Thread 1 in core 1 wants to modify a different portion of the same cache line, as well. The hardware coherency logic detects that the cache line has been modified by core 0. As a result, the L1 cache line from core 0 must be evicted to L2 cache, and then loaded into core 1’s L1 cache before the update can be completed.

Figure 5: False sharing

Moreover, in multiprocessor systems, there could be false sharing between cores that are not sharing the cache. For example, in Figure 1, threads running on core 0 will possibly run into false sharing issues with threads running on core 2.

Tips to fix false sharing

One of the most common techniques to solve false-sharing issues is to allocate non-shared data to different cache lines. The other technique is to copy the global variable to a local function variable, then copy the data back before the function exits. During the process, group the data structures together and keep the size of the data elements at or just below a multiple of cache-line size, in order to avoid wasting too much cache resource. Another tip to fix false sharing is to consider moving the threads to the same core.

More information on false sharing can be found in Ref [6] and [8].


Cache-friendly design and performance-tuning

Since cache is such an important resource, designers may want to pay attention to it throughout all phases of development, including system design, system diagnostics, and system tuning. Having an understanding of the cache benefits can contribute to each phase, as further described in the following three sections.

System design phase – thread placement based on data sharing/contention

System/software designers need to be aware of the shared cache architecture, beginning in the system design phase, to best create a system devoid of cache issues. Look for the following scenarios when designing applications, and choose the processor affinity accordingly:

  • If two threads are completely independent of each other and their relationship in cache usage is purely contentious, they may be better off assigned to cores that are not sharing the last-level cache (for example, core 0 and core 2 in Figure 1). By doing that, we ensure that the performance is on par with the private cache architecture in this scenario.
  • If two threads are intimately sharing the data and need to communicate with each other very frequently, then the best option may be to assign them to the same core so they share the L1 cache.
  • If it is somewhere in between the above two cases (two threads need to share some data, but not with great frequency), it may be worth exploring assigning them to two cores that are sharing the L2 cache (for example, core 0 and core 1 in Figure 1). By doing that, it may save time by avoiding going all the way to the memory for the shared data by using the shared L2 cache.

 

System diagnostic phase - identify bottlenecks and monitor cache events using Intel® VTune™ technologies

The next natural step will be to run performance diagnostics. Intel VTune and threading utilities (Intel® Thread Checker, Intel® Thread Profiler, etc.) provide a nice set of tools for identifying the potential bottlenecks in a multi-core environment. VTune utilities make it easier to trace call graphs, sample targeted events, and identify possible problem areas such as deadlocks or data contention. The Intel VTune Web site and Intel® Developer Zone Web site are good resources for obtaining information about utilizing these technologies. The hotspots and bottlenecks flagged by the tools, as well as those high concentrations of cache-miss events, are usually areas that require further investigations.

System tuning phase - fine-tune system performance

Once the bottlenecks are identified, it is time to use some of the techniques described in this document:

  • Check processor affinity again to see if cache contention can be avoided or cache-sharing can be facilitated by relocating the threads
  • Check for a high concentration of cache events to see if there is false sharing, and fix it by relocating a data portion or assigning it to the same core
  • Check data loops to see if cache blocking can be used
  • Check data loops to see if a hold approach can be used to be more disciplined when updating the shared copy
  • Check if cheaper synchronization primitives can be used
  • Check if a delay approach may be suitable

 

Certainly, system designers and end-users may find more and more techniques (and interesting ways) to take advantage of the shared cache.

Examples

The following sections will delve into several examples for the techniques we have discussed. The experiments were run and the results captured on a dual Intel® Core™ Duo T2500 processor.

Example 1: Processor affinity

To illustrate the effect of using different processor affinity schemes, we run a simple data-sharing application with two affinity configurations. One configuration (Figure 6 A) assigns the two threads that need to share data to core 0 and 2, and the other configuration (Figure 6 B) assigns them to the same core. Circled in green are L2 cache-request misses and L2 cache-read misses that were captured by VTune Sampling Wizard. Note that the differences between the two tests are nearly a factor of 2. This simple example demonstrates that processor affinity has a significant impact on multi-core application performance, and it is very important to use it properly. For information on the related function calls, please refer to the previous section on processor affinity.

Figure 6: Effect of processor affinity in a multi-core/multiprocessor environment

 

Example 2: Data tiling and cache blocking

Let us take a look at the effect of using the cache blocking technique on some simple code. To get a baseline, we will first run a data-loop processing algorithm that writes a value to a memory location, and then later reads it to perform another calculation. Each byte of a 2MB data block is processed individually, starting from byte 0 in a non-blocking fashion, and, for simplicity, the Writer and Reader loops are contained in the same thread. Then we will introduce cache blocking within each processing loop (i.e. the write loop and the read loop), using several different block sizes. The pseudocode for the case without cache blocking follows:

#define NUM_OP (400) /* Repeat 400 times for better measurement */
#define TOTAL_SIZE_B (2097152) /* 2 MB data size */

for (i = 0; i < NUM_OP; i++)
{
/* Writer */
for (number = 0; number < TOTAL_SIZE_B; number++)
{
Process one byte data;
}

/* Reader */
for (number = 0; number < TOTAL_SIZE_B; number++)
{
Read and send out one byte data;
} /* End of one round of processing over all data */

} /* End of test loop */

 

Then we introduce the cache blocking into the flow by breaking the 2MB data blocks into multiple chunks, and having the loop run over each chunk. We have tried numerous cache blocking sizes (CACHE_BLK_SZ_B) ranging from 4KB to 64KB. The pseudocode for the cases with cache blocking is:

#define NUM_OP (400) /* Repeat 400 times for better measurement */
#define TOTAL_SIZE_B (2097152) /* 2 MB data size */
#define CACHE_BLK_SZ_B (X) /* X = 4, 8, 16, 32, 64 KB */
#define NUM_BLKS (TOTAL_SIZE_B/CACHE_BLK_SZ_B)

for (i = 0; i < NUM_OP; i++)
{
for (number = 0; number < NUM_BLKS; number++)
{
/* Writer */
for (j = 0; j < CACHE_BLK_SZ_B ; j++)
{
Process one byte data;
}

/* Reader */
for (j = 0; j < CACHE_BLK_SZ_B ; j++)
{
Read and send one byte data;
}
} /* End of one round of processing over all data */
} /* End of test loop */

 

As an example, we trace several L2 events for the original code without cache blocking, and then we trace for the modified code with various cache blocking sizes. The following table (Table 1) contains sample data points. Note that with the introduction of cache blocking, the numbers of L2 read misses and L2 request misses are reduced.

A better view of the results is shown as the percentage of reduction in terms of targeted L2 events after applying the cache blocking technique (Figure 7).

Figure 7: Reduction of L2 events as a result of cache blocking techniques

 

Two characteristics stand out in Figure 7. First, all of the cache blocking sizes used in this experiment are able to reduce the amount of L2 miss and eviction events. Second, it is not true that the smaller the block size the better the reductions are; instead, there appears to be a sweet spot between 16KB and 32KB where the cache blocking achieves the rate reduction without going to finer granularities. Of course, the range of the sweet spot may change depending on the cache-usage situation and individual application, as well, so it must be evaluated case by case.

We have illustrated the effect of cache blocking with a simple example. By using this method, the cache traffic is reduced and the cache hit rate is improved. For more information on cache blocking techniques, please refer to Ref [7].

Example 3: Delay

As we discussed earlier and illustrated in Figure 4, by introducing a small amount of delay between one core, which is generating data, and the other core, which is requesting data, it may be possible to improve the cache hit rate slightly. However, in reality, we find it rather difficult to get a crisp cut of the amount of delay needed. There are contenders for the cache since other applications are running as well.

We have implemented the exact flow as described in Figure 4 A and then tried to insert a certain amount of delay (measured by the size of the data generated by core 0 before notifying core 1 for processing) as described in Figure 4 B. The results (Figure 8) show that there are slight drops in terms of the number of “L2 Read Miss” events after introducing the delays. Certain delay size (1KB, as an example) seems to be more effective in the test we run in reducing the L2 read miss rate.

Figure 8: Effect of the delay technique in a simple example

 

We also found that the environment is noisy and the delay effect may not be obvious, depending on what other processes are running at the same time. As the benefit of this method seems to be marginal and volatile (note that there is only a 7% reduction in 1KB case and the number may vary each time), it is only recommended as a possible area for final performance-tuning. It may be effective in situations when the systems have firm control on what processes are running and when consistent delay parameters can be obtained.


Conclusion

Shared-cache architecture multi-core processors, such as the Intel Core Duo processor, take one revolutionary step toward bringing the benefits of power-saving, dynamic cache utilization and design flexibility to system designers and end-users.

In this document, we have described the advantages of having shared-cache architecture, including efficient resource usage, reduced cache-coherency complexity, reduced data-storage redundancy and reduced front-size bus traffic. We have also summarized several software techniques to use the shared cache more effectively:

  • Use processor affinity to avoid data contention and facilitate data sharing
  • Cache blocking (or data tiling)
  • Hold approach
  • Avoid false sharing
  • Delayed approach

 

As the industry begins to utilize shared-cache multi-core processors, more and more techniques will be established by system designers and end-users to use the feature to their advantages. It is also expected that the tools and compilers for multi-core systems will be able to adopt some of the techniques and make it easier and faster to develop efficient multi-core applications in near future.


References

 


Additional Resources

 


About the Authors

Tian Tian is a Software Technical Marketing Engineer supporting multi-core software development as well as the Intel® IXP2XXX Network Processor family. His previous experience includes developing IXP4xx product line network processors NPE (Network Processor Engine) microcode. He was also the main designer for IXP4xx product line Operating System Abstraction Layer (OSAL) in software release v1.5. Currently he is engaged with developing methods to optimize applications for multi-core processors.

 

Chiu-Pi Shih is a Software Technical Marketing Engineer supporting multi-core software development as well as the Intel® IXP2XXX Network processor family. Her previous experience includes working as software engineer for Intel® IXA Software Development Kit (SDK) Tools. She currently leads the support for SDK development tools. Prior to join Intel, Chiu-Pi worked for Compaq/Digital Alpha64 motherboard gr oup.


 

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

's picture

Hi,

I have a Intel Core Duo. Cache size 3MB.

From the article above:
•Reduce cache-coherency complexity
•Reduced false sharing because of the shared cache

I tried a basic false sharing program with two thread and 2 global variables falling on same cache line.
This program takes more time to complete than program without false-sharing.

So My question is how does the shared cache reduced the false-sharing?

In my cases it has not reduced false-sharing

Reards,
Manish

's picture

They will never answer you, cause this technology is a scam. Take CPU with dedicated for each core L2 cache and it will be much faster that shared cache.