Intel® Time Coordinated Computing Cache Allocation Samples

Overview

 

Explore the cache allocation strategies to create performant real-time applications using Intel® Time Coordinated Computing.

Transcript

 

In this video, we're going to be talking about Intel's TCC tools, cache allocation library, utilities and samples. We're going to be starting off first with talking about the Intel real-time configuration manager. After that we'll introduce the cache allocation library, which contains a C-level API that allows you to allocate memory that is particular for your particular real-time process. Next will be going through a typical cache allocation optimization workflow. We'll then be talking about the command line tools that are associated with the cache configurator. Finally will follow up with cache allocation examples that'll show you how to measure cache allocation latencies with an internal noisy neighbor and an external noisy neighbor.

When dealing with real-time systems, cache configuration is an important concept to optimize. Cache misses negatively affect the latency of real-time applications. With Intel® TCC Tools, you can reduce cache misses by allocating buffers that are less likely to be evicted from the processor's cache.

Here, the system uses software SRAM processes to create low latency buffers. Software SRAM is a software construct that uses hardware capabilities to allocate a portion of the physical address space into cache. The addresses are less likely to be evicted by the same or other processes.

The Real-Time Configuration Manager is an EFI application that makes it possible to use software SRAM technology on non virtualized systems. RTCM enables software SRAM technology by implementing functionality traditionally reserved for hypervisors. To use software SRAM, you must have an RTCM or hypervisors that supports software SRAM.

For real-time application development, the cache allocation library is a set of C language APIs that help reduce memory access latency by allocating buffers that cannot be evicted from the cache. The library allocates buffers in L2 cache and L3 cache to meet low latency requirements. What are the advantages of the cache allocation library?

One: The library helps bound the time needed to access data from memory buffer to a certain maximum tolerable latency. To achieve low latency, it creates a buffer in cache using a mechanism that helps protect the data from being evicted by the same or other processes attempting to use the same cache. Protection solely from other processes may be obtained by simply using the Cache Allocation Technology without the cache allocation library.

Two: The library will enable you to move your application from one Intel platform to another without refactoring your code, as long as your application is running on a supported platform. The library will assess the memory hierarchy of the platform and create a buffer based on the latency requirement.

Three: The library is useful when you need to bound the worst-case execution time of a particular function in your application and cannot achieve that worst case execution time with conventional memory allocation methods like malloc. Due to a memory access jitter, you can replace malloc with the cache allocation library and turn it into a worst-case execution time buffer.

Intel recommends the following workflow for cache allocation.

Step 1: Determine how much cache should be reserved for the software SRAM regions. This is determined by requirements for the buffer needed by the system and applications.

Step 2: After determining how much of the cache should be reserved for the software SRAM regions in your application, use the cache configurator to reserve that portion of cache, now called the software SRAM. This makes it no longer available to the rest of the system, and it's only available via the cache allocation library. Determine how to partition the remaining cache between the CPU cores, GPU, and I/O. Intel provides a set of presets of varying levels of cache isolation and software SRAM to cover typical scenarios already in the cache configurator tool.

Step 3: Prepare your real-time application to execute with specific CPU affinity based on the reserved cache region.

Step 4: Use cache allocation library APIs to allocate memory in the cache region that has been reserved.

Typically, the tool is used during the development phase to achieve an optimal cache partitioning scheme as determined by the system integrator with feedback from the application developers. If the cache partitioning requirements change after a system has been deployed into production, it's possible to specify a new cache partitioning scheme, including software SRAM regions, simply by rerunning the tool on the target system.

The cache configurator is a command-line tool that enables you to discover and manage cache memory resources. Low-level system resources, such as cache, memory, or CPU cores are traditionally managed by operating systems or hypervisors.

The tool can:

  • Provide low-latency buffer access (via software SRAM regions) to real-time applications running on the system (via the cache allocation library API)
  • Provide mechanisms to improve the worst-case execution time
  • Minimize the impact the GPU has on real-time applications running on the CPU cores
  • Partition the shareable cache resources among the various components using the cache (such as CPU, GPU, or I/O)

Intel recommends those experimenting with software SRAM or cache partitioning to start by using the presets included in the cache configurator tool. Presets are designed to cover the majority of use cases. You can also use the cache configurator to further refine the configuration as needed.

In general, the presets partition the cache to:

  • Establish isolated cache regions dedicated to real-time applications, also known as workloads
  • Restrict the GPU from accessing the entire L3 cache, and avoid overlap with cache partitions dedicated to real-time workloads
  • Dedicate a small portion of the cache for low-latency I/O operations
  • Configure L2 and L3 software SRAM regions

Let's take a look at a couple of the demos that show the noisy neighbor effect, both using an internal noisy neighbor and an external noisy neighbor.

We’ve already run the Real-Time Readiness Checker to determine that the system is set up for real-time computing.

Now we’ll look at the cache allocation. We will see the impact of leveraging cache allocation libraries to reduce buffer latency. You will be able to run these samples yourself as they are included with the Intel TCC Tools package to see improvements in latency reduction.

First, when you're running these demos, make sure that you SSH into your target machine and run the u name -v command. Make sure that you are using the preempt_rt flag and that it is set. This lets you know that your particular operating system has real-time capabilities compiled in.

For the first demo, we're going to run the internal noisy neighbor example. Go to the documentation and find the internal noisy neighbor example. The documentation will walk you through this demo, and we suggest that you begin with the parameters that have been outlined here. After, feel free to change and experiment with them.

Copy the command to run the example. The collect flag enables the measurements and analysis tools to record information about the latency in the example. The stress flag runs an internal noisy neighbor. In this test run, we're going to set the maximum latency to 200 nanoseconds. This tells us that the cache allocation example is to use DRAM as it is slow enough to accomplish the requirement. In the output, we see that the latency used was 200 nanoseconds. The stress was internal meaning that the noise came from the same process as the measurement that was taking place. We did 100 iterations and below we can see the minimum latency, the maximum latency, and the average latency. Here the minimum latency is 73 nanoseconds. The maximum latency is 89 nanoseconds and the average latency is 81 nanoseconds. Now we run the exact same example. Except for this time, rather than using a maximum latency of 200 nanoseconds, we're going to reduce that by 45 nanoseconds. This tells the sample that cache is needed.

Notice that the latency per buffer access is significantly lowered in this example, with the minimum being six nanoseconds, the maximum being eight, and the average being seven. This is because this time the buffers have been allocated in the L2 cache rather than in DRAM.

If you were to take the data from these two runs and graph them, you would see a chart that looks something like this. Notice that the top of the yellow data is the first run. This is the run that was allocated mainly in RAM. There's a certain number of variants in the data. As you can see in the vertical height. Now, look at the second run. The second run was allocated entirely within the L2 cache. One of the first things you will notice is that there's much less variants. There's much less deviation in the data that's presented here.

Now let's run the same example, except for this time but with the latency being generated inside the process. It's going to be an external process on the same machine, which is generating the latency. To generate latency, we're going to use the stress_ng command, which is an open source tool that anyone can download and use. Next, jump back to the documentation for the external noisy neighbor cache allocation example and copy of the command line, which launches the example.

Notice that just like the internal noisy neighbor example, the latencies are all somewhat higher and these allocations have taken place inside of RAM. Next, we'll run the external noisy neighbor example with the maximum latency of 45 nanoseconds. You can see that the results are similar again to the internal noisy neighbor example. This time the allocations are taking place in L2 cache and are significantly faster with lower jitter.

This gives you a better understanding of what it's like to use the cache allocation library in order to develop real-time applications. Download and test the Intel TCC today.

To find out more. I'd encourage you to visit the Intel IoT Real-Time Technologies page and understand the foundation of real-time tuning that Intel offers. Please read the published tuning guides for the supported processor families, and to get hands-on experience with TCC, download the BSP and other basic software packages from the Intel Resource and Design Center. Finally, download and try the TCC tools package from the Intel® Developer Zone.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.