Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Reduce Latency (External Noisy Neighbor)

Real-time workloads running on platforms with shared L2 and L3 cache compete with non-real-time applications running on neighboring cores. To simulate this case, the cache allocation sample provides the option to run your choice of an external noisy neighbor.
This example is similar to Reduce Latency (Internal Noisy Neighbor) except it uses the Linux tool stress-ng as a noisy neighbor instead of using the sample’s internal noisy neighbor.
To run this example:
  1. From your host system, connect to the target system:
    ssh <user>@<target>
  2. In the SSH session
    , run the Linux* tool stress-ng as a noisy neighbor:
    stress-ng -t 0 -C 10 --cache-level 2 --taskset 2 --aggressive
    The following table provides a brief description of stress-ng options (for details, see stress-ng documentation):
    Option
    Description
    -t 0
    Run without any timeouts (run forever).
    -C 10
    Start 10 workers (child applications) that perform random widespread memory reads and writes to load the CPU cache.
    --cache-level 2
    Use L2 cache.
    --taskset 2
    Run on Core 2.
    --aggressive
    Apply a high level of stress.
  3. Confirm that you see the “dispatching hogs” message.
  4. Open a second terminal window and connect to the target system:
    ssh <user>@<target>
  5. In the second terminal window, run the sample with a latency value that corresponds with DRAM. For example:
    tcc_cache_allocation_sample --collect --latency 300 --sleep 100000000 --mask_interrupts
    The command-line options are similar to the previous example, except the
    --sleep
    option replaces the
    --stress
    option. The
    --sleep
    option defines the inactivity period between workload iterations. During this period, the noisy neighbor has the opportunity to evict the workload’s cached data.
  6. Confirm that you see output similar to the example below (statistics may vary).
    Output example:
    Running with arguments: latency = 300 ns, stress = external (sleep 100000000 ns), iterations = 100, collector = libtcc_collector.so, mask interrupts = true Allocating memory according to the latency requirements Running workload. This may take a while, depending on sleep and iteration values Deallocating memory *** Statistics for workload **************************** Minimum total latency: 73935 CPU cycles (26375 ns) Maximum total latency: 115494 CPU cycles (41200 ns) Average total latency: 88530 CPU cycles (31581 ns) Minimum latency per buffer access: 144 CPU cycles (51 ns) Maximum latency per buffer access: 226 CPU cycles (80 ns) Average latency per buffer access: 173 CPU cycles (61 ns) ********************************************************
    The output shows the arguments in use and the status of the sample. The statistics section shows the minimum, maximum, and average total latency.
    • Minimum total latency is the execution time of the shortest iteration.
    • Maximum total latency is the execution time of the longest iteration.
    • Average total latency is the average execution time across all iterations.
    The output also shows the minimum, maximum, and average latency per data access. Latency per data access is a normalized latency that is obtained by dividing the total time of reading the buffer in one iteration by the number of accesses to the buffer.
  7. Note the maximum total latency and the maximum latency per buffer access.
  8. Run the sample again with a latency value that corresponds to L2 cache:
    tcc_cache_allocation_sample --collect --latency 45 --sleep 100000000 --mask_interrupts
  9. Confirm that you see output similar to the example below (statistics may vary).
    Output example:
    Running with arguments: latency = 45 ns, stress = external (sleep 100000000 ns), iterations = 100, collector = libtcc_collector.so, mask interrupts = true Allocating memory according to the latency requirements Running workload. This may take a while, depending on sleep and iteration values Deallocating memory *** Statistics for workload **************************** Minimum total latency: 7125 CPU cycles (2541 ns) Maximum total latency: 7302 CPU cycles (2604 ns) Average total latency: 7265 CPU cycles (2591 ns) Minimum latency per buffer access: 14 CPU cycles (4 ns) Maximum latency per buffer access: 14 CPU cycles (5 ns) Average latency per buffer access: 14 CPU cycles (5 ns) ********************************************************
  10. Note the maximum total latency and the maximum latency per buffer access.
  11. Compare the maximum total latency and the maximum latency per buffer access of this run to those of the previous run. You can expect a general improvement.
  12. In the first terminal window, press Ctrl+C to exit stress-ng.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.