Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

How the System Allocates Buffers

The cache allocation library performs buffer allocation in L2 or L3 cache by accessing an already prepared software SRAM buffer.
The library processes the latency parameter by comparing it to the latency values received from an OS driver called the
real-time configuration driver
.
Using the following values as an example (11th Gen Intel® Core™ processors), the library allocates buffers as follows:
  • If the latency parameter is lower than 6 ns, the library cannot create a buffer because the latency is too low. The library returns a null pointer.
  • If the latency parameter is in the range of 6 to 48 ns, the library allocates a buffer in L2.
  • If the latency parameter is in the range of 49 to 113 ns, the library allocates a buffer in L3.
  • If the latency parameter is 114 ns or more, the library allocates a buffer in DRAM using standard memory functions (malloc, calloc, realloc, free).
  • If you were to move your application to a platform with only L2 cache and DRAM, the library would allocate a buffer in L2 for any latency request in the L3 range.
The library processes the buffer size parameter by comparing it to the values in the config file,
config/.tcc.config
. The buffer size values in the config file are the maximum size allowed for one application.
The total size of a software SRAM buffer is defined using Cache Configurator and may be shared by multiple applications. Increasing the buffer size value in the configuration file beyond the software SRAM size set in Cache Configurator will have no effect.
The configuration file is shared among all applications that access software SRAM. The buffer size value can be used to limit software SRAM consumption by one application and to prevent any application from taking all software SRAM.
Buffer size values for DRAM are an internal mechanism and have no limiting effect on the requested buffer size.
The following table shows a summary of the default buffer size values in the config file.
Maximum L2 Buffer Size (bytes)
Maximum L3 Buffer Size (bytes)
Maximum DRAM Buffer Size (bytes)
98304
262144
Unlimited
For systems running the PR1 release of the Yocto Project*-based BSP for Intel Atom® x6000E Series processors, you need to reboot the target after changing the size in the .tcc.config file.
You can allocate multiple buffers in one application up to the size in
.tcc.config
. The library takes 32b for service data from the empty buffer and 8b per allocation. Therefore, each allocation is the requested buffer size + 8b.
The current implementation does not support memory defragmentation. After allocating and freeing multiple buffers, the software SRAM may become highly fragmented, making it impossible to allocate large buffers. Intel does not address fragmentation as it is anticipated that software designs using software SRAM preallocate memory upfront for the lifetime of the application.

Buffer Performance

In Determining Cache Allocation Library Inputs, the latency values in the table indicate the worst-case buffer access times for the cache. Worst-case buffer access time refers to the observed longest amount of time for the application to access an element from the buffer from its location in the memory hierarchy. Generally, the best-case buffer access time will occur when the buffer resides in the L1 cache and the system is nearly idle, and the worst-case buffer access time will occur when the buffer resides in DRAM and the system is extremely congested.
The latency values in the table are based on the worst possible environment. The following factors contribute to this environment:
  • Memory access pattern of the real-time application
  • Load
Memory Access Pattern
The time that it takes to access an element of the buffer is directly related to where the element currently resides in the memory hierarchy. Although the cache allocation library has the ability to “lock” buffers into specific levels of the memory hierarchy, the actual level where an element will be found is dependent on the way or pattern in which the buffer is accessed. This buffer access pattern is the manner in which the real-time application reads buffer elements. The application can read buffer elements in an arbitrary manner. However, linear and random access are the limit cases and the spectrum of access lies between the two.
In addition to the spatial locality of the access pattern (linear vs. random), the delay between references plays a part in where in the memory hierarchy the buffer element would be found. Arithmetic intensity is a measure of the amount of compute done on each element retrieved from memory. A high arithmetic intensity would result in the CPU taking longer to reference the next element due to the sheer number of calculations that it must perform on the current element before referencing the next element. A low arithmetic intensity would result in near back-to-back memory references.
With a sufficiently high arithmetic intensity and a sufficiently linear access pattern, the hardware prefetchers would be able to “read-ahead” and prime the cache with the data.
In contrast, the worst-case access pattern is random pointer-chase.
Random pointer-chase is a linked list where the application is randomly accessing data. Because of the random nature of the pointer-chase, there is no spatial locality, and the data is generally not prefetched into cache. As a result, each access of an element in the buffer incurs the latency for whatever level of the memory hierarchy that the library “locked” the buffer into.
Load
Load refers to the congestion of the system from a so-called noisy-neighbor workload. The noisy neighbor impacts the buffer access time whenever it is competing for shared resources with the function attempting to access the buffer. Some of these shared resources could include:
  • Core resource congestion due to another hardware thread (Hyper Threading)
  • Translation Lookaside Buffer (TLB) congestion
  • Congestion in any shared path or buffers between the core and the cache or memory
  • Congestion in the shared cache itself
  • Congestion in the Memory Controller
The worst-case interference exploits congestion of these shared resources.
Summary
The cache allocation library uses latency values that were collected with the worst possible combination of memory access pattern and load. It is expected that for conditions in your environment, the actual measured access time will never exceed the worst time from the table.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.