Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Reduce Latency (Internal Noisy Neighbor)

You can run the cache allocation sample to see a general improvement in maximum latency when the cache allocation library allocates a buffer in L2 cache versus a standard buffer in DRAM. This example uses the sample’s internal noisy neighbor to load the system.
In this example, you will run the sample twice.
  1. In the first run, you will specify a high latency requirement, which will cause the sample to allocate a buffer in DRAM via standard malloc.
  2. In the second run, you will specify a low latency requirement, which will cause the sample to use the cache allocation library to lock data in L2 cache.
  3. You will compare maximum latency measurements and can expect an improvement when data is locked in L2 cache.
    The example shows the most visible effect of using the cache allocation library. For this purpose, the following commands use interrupt masking during workload execution. While interrupt masking from userspace applications is not a common practice, there are other methods used by real-time application developers for the same purpose of reducing OS impact on the workload execution. These methods depend on the system configuration and often include routing interrupts to a non-real-time CPU core.
To run this example:
  1. From your host system, connect to the target system:
    ssh <user>@<target>
  2. In the SSH session
    , run the sample with a latency requirement that corresponds to DRAM:
    tcc_cache_allocation_sample --collect --latency 300 --stress --mask_interrupts
    where:
    Option
    Description
    --collect
    Enable measurement result collection.
    --latency 300
    Maximum tolerable latency for a single cache line access in nanoseconds.
    --stress
    Run the sample’s internal noisy neighbor on core 3.
    --mask_interrupts
    Enable interrupt masking. This flag clears the interrupt enable flag in the core (using the x86
    cli
    instruction), which masks hardware maskable interrupts. For more information about masking interrupts, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A, Section 2.3, “SYSTEM FLAGS AND FIELDS IN THE EFLAGS REGISTER.” The interrupts are masked only for the duration of the pointer-chase task to avoid negative effects on the overall system behavior.
  3. Confirm that you see output similar to the example below (statistics may vary).
    Output example:
    Running with arguments: latency = 300 ns, stress = internal, iterations = 100, collector = libtcc_collector.so, mask interrupts = true Allocating memory according to the latency requirements Running workload. This may take a while, depending on sleep and iteration values Deallocating memory *** Statistics for workload **************************** Minimum total latency: 102318 CPU cycles (36500 ns) Maximum total latency: 125710 CPU cycles (44845 ns) Average total latency: 112710 CPU cycles (40207 ns) Minimum latency per buffer access: 200 CPU cycles (71 ns) Maximum latency per buffer access: 246 CPU cycles (87 ns) Average latency per buffer access: 220 CPU cycles (78 ns) ********************************************************
    The output shows the arguments in use and the status of the sample. The statistics section shows the minimum, maximum, and average total latency.
    • Minimum total latency is the execution time of the shortest iteration.
    • Maximum total latency is the execution time of the longest iteration.
    • Average total latency is the average execution time across all iterations.
    The output also shows the minimum, maximum, and average latency per data access. Latency per data access is a normalized latency that is obtained by dividing the total time of reading the buffer in one iteration by the number of accesses to the buffer.
  4. Note the maximum total latency and the maximum latency per buffer access.
  5. Run the sample again with a latency value that corresponds to L2 cache:
    tcc_cache_allocation_sample --collect --latency 45 --stress --mask_interrupts
  6. Confirm that you see output similar to the example below (statistics may vary).
    Output example:
    Running with arguments: latency = 45 ns, stress = internal, iterations = 100, collector = libtcc_collector.so, mask interrupts = true Allocating memory according to the latency requirements Running workload. This may take a while, depending on sleep and iteration values Deallocating memory *** Statistics for workload **************************** Minimum total latency: 8728 CPU cycles (3113 ns) Maximum total latency: 10029 CPU cycles (3577 ns) Average total latency: 9287 CPU cycles (3312 ns) Minimum latency per buffer access: 17 CPU cycles (6 ns) Maximum latency per buffer access: 20 CPU cycles (6 ns) Average latency per buffer access: 18 CPU cycles (6 ns) *******************************************************
  7. Compare the maximum total latency and the maximum latency per buffer access of this run to those of the previous run. You can expect a general improvement.

An Example of Latency Measurements for Data Access

As an example, the following graph shows the results of data access latency measurements collected from an 11th Gen Intel® Core™ processor system. The blue dots represent measurements taken from a run where DRAM was selected. The orange dots represent measurements taken from a run where L2 cache was selected. Each data point shows normalized latency that was obtained by dividing the total time of reading the buffer by the number of accesses.
The left panel of the image shows a histogram of data access latency, where the x-axis counts the number of times when measured latency hits a specific range, and the y-axis shows time in nanoseconds. The counts are plotted in the logarithmic scale. The right panel shows the measured latency of each data access as a function of the iteration number.
These graphs are for documentation purposes only and were not generated by the sample or other provided tools. As an alternative, the Measurement Analysis Sample provides support for simple histograms.
These results show that memory access latency to the buffer allocated in L2 cache is smaller than to the buffer allocated in DRAM. The memory access latency measurements are lower than the latency requirement provided. This example demonstrates how the cache allocation library can help to fit memory access latency into the specific maximum bound.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.