Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Data Stream Latency Requirements Examples

This section supplements the guidance in Generate Requirements for those who want to learn more about decomposing a full workload into individual data streams. Topics also include identifying safety margin, static and network overhead, measuring the workload and stream latencies, and finally determining the target workload and stream latencies to be entered into a requirements file.
The process of deconstructing a workload into data stream latency requirements can be complex and requires analysis of transaction flow. This section will go through two example workloads, RTCP and PTL, and break them down into individual data stream latency requirements.

Definitions

Terms
Source
Description
Full Workload
Given
Name of the full workload being tuning
Streams
Workload deconstruction
List of data streams that constitute the full workload
Target Full Workload Latency
Given
The target latency of the full workload being tuned
Measured Full Workload Latency
Measured
The measured latency of the full workload being tuned
Measured Stream Max Latency
Measured
The measured maximum latency of the individual streams
Measured Stream Min Latency
Measured
The measured minimum latency of the individual streams
Static SW Overhead
Measured
The measured static software overhead latency of this full workload that is not tuned by the data streams optimizer (can be reduced via software tuning)
Safety Margin Overhead
Given
The amount of time that the workload can exceed its allotted latency budget without violating the latency deadline
Network Overhead
Given
The portion of the control loop allotted to the traversal of data across the network into and out of the system
% of Tunable Streams
Calculated
Per stream: Measured Stream Max Latency / (Cumulative sum of all Measured Stream Max Latencies)
Jitter Ratio
Calculated
Per stream: (Measured Stream Max Latency - Measured Stream Min Latency) / Measured Stream Max Latency
Target Stream Max Latency
Output of this document
The target maximum latency of the individual streams. This is the per-stream latency requirement needed for the requirements file of the data streams optimizer.

Procedure for Finding Stream Latency

  1. Basic platform without data streams optimizer:
    This guide assumes you have already measured the performance of the system in a tuned software environment using the following documentation and software.
    1. Software tuning
      Tune the software environment before using this guide. For details, see Software Tuning.
    2. Intel reference software
      The board support package (BSP) and reference BIOS with Intel® TCC Mode provide a base level of performance out-of-the-box and should be applied to the system before using the data streams optimizer.
  2. Deconstruct the workload into data streams:
    Some workloads contain complex interactions between hardware and software components. You should analyze the workload into a step-by-step flow to list the real-time data streams, which can be optimized by the data streams optimizer, and the sources of static overhead, which can’t be optimized by the data streams optimizer. For details, see How to Measure Individual Stream Latencies.
  3. Measure latency information if necessary:
    One of the inputs of the data streams optimizer is a requirements file. The requirements file has a required field containing a latency value for each real-time data stream that will be optimized. In real-world terms, this number represents how much time each real-time data stream in the critical path of a workload has to transmit its data before a timing violation occurs (this can take the form of missing a critical deadline, overflowing a buffer, using stale data, servicing an interrupt too slowly, etc.).
    1. When additional measurements are necessary:
      It is often the case that some specification, such as a hardware specification, protocol specification, or system requirements, designates a deadline for some component in a system. This deadline may be explicitly defined in the documentation or an implicit emergent requirement.
      • Explicit stream requirements occur when some documentation defines a deadline for a real-time data stream.
        Example: If a workload uses a 31.25 microsecond control loop, and the specification says that PCIe must write data into memory within 5 microseconds and PCIe must read data from memory within 5 microseconds, then these are two explicitly defined latency requirements that can be mapped into the requirements file: PCIe-to-memory stream at 5 microseconds, PCIe-from-memory stream at 5 microseconds.
      • Implicit stream requirements occur when the latency requirements can be gleaned indirectly from the documentation.
        Example: A workload uses a 62.5 microsecond control loop, and the specification says that PCIe must write data into memory (latency not given by specification) and the core must read that data within 52.5 microseconds. There is an explicitly defined memory latency requirement: core reading from memory at 52.5 microseconds (this is not optimized by the data streams optimizer, but the cache allocation library or cache configurator can optimize core to memory traffic). There is also one implicitly defined stream latency requirement: PCIe-to-memory stream at 10 microseconds. This is calculated by subtracting the core reading from memory latency from the full control loop duration.
      While you may arrive at the explicit and implicit requirements differently, neither of them requires any measurement to be performed. You can obtain the individual stream requirements from workload deconstruction and analysis.
      Additional measurements must be collected when individual stream latencies cannot be derived from the documentation because the amount of time taken by each stream is ambiguous.
      Example: A network device must transmit a packet within 13 microseconds as described in Packet Transmission Latency (PTL) Example. The specification says that this packet transmission requires that the core must write data to PCIe and PCIe must read data from memory. From this information, it is unknown what the requirements are for the two streams. One must actually measure the performance of each stream and use that to calculate the target latencies for each stream.
    2. What to measure:
      At a high level, you must measure enough information to determine the target latency for each stream. For details, see Latency Information Necessary to Calculate Stream Latency Requirements.
    3. How to measure:
      Individual stream latencies are measured through a modularization of the workload, which can be achieved through code instrumentation, analysis hardware, and proxy workloads. For details, see How to Measure Individual Stream Latencies.
  4. Calculate individual stream latencies:
    The premise behind calculating the measured performance is as follows:
    1. Measure the latency of each stream.
    2. Calculate what percentage of the full workload latency each stream uses (see the table below).
      % of Tunable Streams = Measured Stream Max Latency / Sum of all Measured Stream Max Latencies
      Full Cycle Workload
      Measured Full Cycle Latency (µs)
      Streams
      Measured Stream Max Latency (µs)
      Measured Stream Min Latency (µs)
      % of Tunable Streams
      Name
      Stream 1
      Name
      Stream 2
      Name
      Static Overhead
      N/A
      N/A
    3. Reduce each stream proportionally such that they do not cumulatively exceed the target full workload latency (see the table below).
      Target Stream Max Latency = (Target Full Workload Latency – Sum of all Static Overhead) * % of Tunable Streams
      Jitter Ratio = (Measured Stream Max Latency – Measured Stream Min Latency) / Measured Stream Max Latency
      Full Cycle Workload
      Target Full Workload Latency
      Streams
      Target Stream Max Latency (µs)
      Jitter Ratio
      Name
      Core to PCIe
      Name
      PCIe from Memory
      Name
      Static Overhead
      N/A

Software Tuning

General-purpose operating systems (such as Linux or Windows) are non-deterministic (due to task switching, interrupts, scheduler, etc.), and thus may require some portion of the real-time control loop to be dedicated to software overhead. The data streams optimizer does not tune this portion of the control loop. The more the software overhead decreases (via software tuning), the more obvious the effects of the data streams optimizer. These examples assume there is some overhead due to software and describes how to incorporate it into stream latency calculations.
Software jitter is different from jitter caused by cache and memory latency, though the two are interrelated because extraneous software running on the core often has not been optimized for cache residency and may even access the disk. As such, latency due to software jitter often dominates hardware jitter, and so any hardware tuning is lost in the noise until software tuning is performed.
Isolating Hardware and Software Jitter
In general, software jitter occurs when a task that is not part of the critical path is running on the real-time core. Determining how much jitter is due to software may not be straightforward, as real-time workloads often involve interactions between real-time code segments, system calls, I/O, accelerators, etc. One method to quantify software jitter is to trace function calls on the system to see how often the real-time software (and the OS or driver functions that it uses) get switched out for extraneous functions, and for how long. Additionally, record how many (unexpected) interrupts occur during the critical section. Another method is to measure how long I/O or other offloaded tasks take and subtract this from the total control loop time.

Latency Information Necessary to Calculate Stream Latency Requirements

To calculate the latency requirements for individual streams, one must first identify the latency requirements for the full workload. This document assumes these requirements are readily available, and measurements show that these requirements have not been met. The list below shows what information should be collected in order to calculate the stream latency requirements.
  • Full workload latency targets or deadlines
  • Full workload measured maximum latency
    • To the extent possible, measurements should be collected within an OS environment similar to the expected production environment including the same or at least similar OS distribution, drivers, kernel version, and OS/kernel configuration parameters.
    • To the extent possible, measurements should be collected with concurrent traffic conditions equivalent to or more strenuous than the expected worst case in the production environment including: core-to-memory traffic, graphics and display traffic, and I/O traffic.
  • Individual stream minimum and maximum latencies
    • The software environment and concurrent traffic should be similar to the production environment as mentioned above.
    • It is not necessary to exhaustively measure every stream of the workload, but enough latency information should be collected to determine roughly what portion of the full workload is spent on the different streams. This will be expanded upon in the next section.
    • After all streams have been measured and accounted for, the latency and jitter left over is likely caused by software jitter and should be incorporated into the stream requirements as shown in the next section.

How to Measure Individual Stream Latencies

Individual stream latencies are measured through a modularization of the workload, which can be achieved through code instrumentation, analysis hardware, and proxy workloads.
  1. Deconstruct the workload into data streams:
    Deconstructing a workload involves analyzing the software and transaction flow to determine the real-time data streams. Any time in the flow where data is being transmitted to or from a producer and consumer, and there is a latency requirement placed on that transmission, a real-time data stream has been identified.
    Wherever there is a segment of the flow that either does not have latency requirements or involves a producer or consumer that is not supported by the data streams optimizer, this is identified as static overhead.
  2. Modularize the workload:
    To perform per-stream latency measurements, the code should be modularized, which involves dividing the code into segments with enough granularity to measure the latency of each stream. Modularization can be achieved in several ways, such as code instrumentation, analysis hardware, and proxy workloads.
    • Code instrumentation:
      You can instrument the code to perform the required measurements by capturing timestamps before and after data transmission. This can be achieved with function wrappers, compiler macros, etc., that insert a timestamp at module boundaries.
      For example, a core-to-PCIe stream could be measured by:
      1. Capture Timestamp X
      2. Update pointer to MMIO address
      3. Capture Timestamp Y
      The latency of this stream would be measured as Y – X.
      In some cases, the timestamp should be captured on the device if it is possible.
      For example, a PCIe-from-memory stream can be measured as follows:
      1. Trigger a timestamp capture on the PCIe device with write to MMIO register
      2. Trigger a DMA read from memory operation on the device
      3. Issue an interrupt to the core when the DMA operation completes
    • Analysis hardware:
      In some cases, stream latency cannot be only measured by code instrumentation. This often occurs when the endpoint device cannot capture a timestamp before or after a DMA operation occurs. In this case, a logic or protocol analyzer can be attached to the system to measure the latency of the stream’s data transmission.
      For example, if it is known that a PCIe-from-memory DMA operation must complete within a certain time, a PCIe interposer and analyzer can capture a bus trace and measure the time between the first PCIe Read Request TLP and the last PCIe Read Completion TLP.
      Advanced Example
      : Measuring PCIe-from-Memory Read Latency with a PCIe Analyzer
      This example shows how individual stream latency and the effect of the data streams optimizer on this latency can be measured with analysis hardware.
      For this test, the Teledyne LeCroy Summit™ T416 Protocol Analyzer and Teledyne LeCroy Analysis Software were used for measurement alongside an Intel® proprietary PCIe exerciser card connected to the Intel Atom® x6000E Series processors platform.
      Hardware specifications:
      • Teledyne LeCroy Summit™ T416 Protocol Analyzer
      • Intel® PCIe exerciser card: PCIe Gen 3.0 x4
      • Platform: Intel Atom® x6000E Series processors PCIe Gen 3.0 x2 (referred to as Intel platform)
      Software specifications:
      • PCIe Protocol Analysis Software Version 9.32 Build 7133
      • Intel® internal PCIe traffic generator
      • Data streams optimizer
      For this test, the PCIe endpoint generated 64-byte read requests targeting host memory at a rate of approximately 30 MB/s. An internal workload was used to generate the PCIe traffic from the endpoint. The PCIe traffic flow (depicted below) comprises the PCIe-from-memory (PCIe reads) stream which refers to a PCIe endpoint issuing a read request to memory and a completion generated back to the endpoint once the requested data has been retrieved. The workload latency of this stream is measured from the time when a PCIe read request is issued on the PCIe link targeting a data buffer in memory to when the PCIe completion with data is sent on the PCIe link. The workload latency would exceed the analyzer latency by a few hundred nanoseconds since the endpoint must latch its internal timestamp registers for the read request and completion.
      Analyzer Trace #1: Measure PCIe-from-Memory Read Latency on the Untuned System
      For this PCIe analyzer trace, the Intel platform was configured with Intel® TCC Mode. The PCIe-from-memory latency reported by the trace is 1.936 microseconds.
      Example requirements for this scenario:
      "requirements": [ { "producer": "Memory", "consumer": "01:00.0", "traffic_class": 0, "latency_us": 2, "bytes_per_transfer": 64, "relative_priority": 0 } ]
      Analyzer Trace #2: Measure PCIe-from-Memory Read Latency on the Tuned System
      For this PCIe analyzer trace, the Intel platform was tuned with the data streams optimizer. The PCIe-from-memory latency reported by the trace is 1.265 microseconds.
      Example requirements for this scenario:
      "requirements": [ { "producer": "Memory", "consumer": "01:00.0", "traffic_class": 0 "latency_us": 1.5, "bytes_per_transfer": 64, "relative_priority": 0 } ]
      Analysis hardware is useful for measuring the tuning of latency for individual streams, such as PCIe-from-memory, because traces can capture accurate hardware timestamps that eliminate overhead from code instrumentation or proxy workloads. Code instrumentation may contribute software overhead, and proxy workloads contain complex traffic patterns that report an overall latency compared to the individual PCIe read latency measured above. This overall latency may be hundreds of nanoseconds to microseconds larger than the PCIe read latency and as such PCIe read latency improvements may be less distinguishable.
    • Proxy workloads:
      In some cases, neither code instrumentation nor analysis hardware is sufficient. You can instead create a proxy workload the performs the same or similar operations as the real-world use case, but it does so in such a way that can be measured through either code instrumentation or analysis hardware.
      For example, if a PCIe-from-memory DMA operation cannot be measured on the real-world workload with either code instrumentation or analysis hardware, a PCIe exerciser card can be configured to generate DMA reads in an identical traffic pattern to the real workload. This latency can then be measured and used to calculate the individual stream latencies.

Examples

This guide describes two sample workloads, Packet Transmission Latency and Real-Time Compute Performance, and gathers the stream latency requirements.
The examples represent different kinds of real-time workloads.
  • Workloads with real-time components:
    Not all workloads have end-to-end latency deadlines. It may be the case that only a small portion of the full workload has latency deadlines. The Packet Transmission Latency (PTL) example shows how to calculate stream latencies for a real-time component workload.
  • End-to-end real-time workloads:
    The classic real-time control loop involves input data being received by a system, some computation based on that data, and some output data being transmitted by that system. The Real-Time Compute Performance (RTCP) example shows how to calculate stream latencies for an end-to-end real-time workload.
To learn more about these examples:

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.