Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Real-Time Compute Performance (RTCP) Example

This example demonstrates how to calculate data stream latencies for an end-to-end real-time workload. An end-to-end real-time workload is the classic real-time control loop, which involves input data being received by a system, some computation based on that data, and some output data being transmitted by that system. The example uses a workload called Real-Time Compute Performance (RTCP).
RTCP is a sample workload that simulates an industrial control loop with input, compute, and output segments. This example provides both target and measured latencies for the full loop as well as individual streams. With this information, the workload can be divided into three different streams with individual latency requirements.

Target User for This Example

For real-time workloads consisting of end-to-end control loops, a system integrator may be tasked with tuning a system to meet use case driven cycle times. As the system integrator, you would have insight into the functionality of and interactions between the various system components within the critical path from data reception, to the compute workload, to the data transmission. You would have the knowledge and ability to:
  • Measure the full RTCP workload
  • Identify the system components and their interactions:
    • Intel® Ethernet Controller I210
    • Data Plane Development Kit (DPDK) Driver
    • Packet receive flow
    • Compute application
    • Packet transmission flow
    • Safety margins
    • Network overhead
  • Deconstruct RTCP into individual streams
  • Measure the individual stream latencies
  • Calculate the individual stream latency targets
Individual component owners, such as device driver developers, would not have the global system viewpoint to be able to understand the interactions between the various components. Additionally, user-space application developers would not have the deep functional understanding of the underlying hardware and software.

Deconstructing RTCP into Streams

Workloads often involve complex interactions between real-time code segments, system calls, drivers, I/O, etc., so deconstructing workloads into streams may be challenging. RTCP contains three major segments that are typical of real-time control loops: packet reception (Rx), compute, and packet transmission (Tx).
Packet Reception
Packet reception involves the host receiving data from a remote system over Ethernet. When the Ethernet controller receives a packet from the PHY, it issues PCIe writes to host memory, sending the packet data and informing the core that the data is available.
Depending on the Ethernet driver configuration, an Ethernet controller might send a message signaled interrupt (MSI) to initiate the packet receive function, but this example assumes a polling configuration. In this guide, the term
message signaled interrupt
or
MSI
is used generically to cover both MSI and MSI-X.
Compute
The compute portion of the RTCP workload performs a memory buffer read workload by iterating through a random linked list.
Packet Transmission
The packet transmission section of RTCP is identical to the PTL workload discussed in the first example. A core-to-PCIe MMIO write transaction initiates PCIe reads from the Ethernet device to be transmitted across the PHY. This shows an example source of component requirements. A smaller workload such as PTL may receive its target full workload latency from a larger, more comprehensive workload (such as RTCP) that it is a component of.
Static Safety Margin, Compute Overhead, and Network Overhead
For workloads, such as RTCP, with target full workload latency requirements equivalent to the end-to-end control loop period, it may be necessary to include other static overhead costs such as safety margin, compute overhead, and network overhead. Consider other sources of static overhead that can’t be defined as real-time streams so will not be tuned by the data streams optimizer.
  • Safety margin
    is the amount of time that the workload can exceed its allotted latency budget without violating the latency deadline.
  • Compute overhead
    is the portion of the control loop allotted to loads and stores to cache and memory. Because the data streams optimizer does not optimize the cache and memory hierarchy, one can consider this time spent on compute to be overhead. Other Intel® TCC Tools such as the cache allocation library can be used to optimize the compute overhead.
  • Network overhead
    is the portion of the control loop allotted to the traversal of data across the network into (preceding packet Rx) and out of (following packet Tx) the system.

RTCP Streams

Given this description, the RTCP workload divides into the following streams:
  • PCIe to memory (Packet Rx)
  • Core to PCIe (Packet Tx)
  • PCIe from memory (Packet Tx)

Calculating RTCP Stream Latency Requirements

The table below shows the latency numbers necessary to determine the stream requirements.
Full Workload
Measured Full Workload Latency
Streams
Measured Stream Max Latency (µs)
Measured Stream Min Latency (µs)
% of Tunable Streams
RTCP
173.00
PCIe to Memory
25.00
0.5
41.7%
RTCP
173.00
Core to PCIe
10.00
0.25
16.7%
RTCP
173.00
PCIe from Memory
25.00
1
41.7%
RTCP
173.00
Compute Overhead
80.00
N/A
N/A
RTCP
173.00
Static SW Overhead
8.00
N/A
N/A
RTCP
173.00
Safety Margin Overhead
12.50
N/A
N/A
RTCP
173.00
Network Overhead
12.50
N/A
N/A
For this example, the RTCP target full workload latency is 125 microseconds. Latency targets for end-to-end workloads that span the duration of the real-time control loop within a system are generally bounded by the control hardware on a system. For example, this workload might drive a robotic arm that requires actuator manipulation every 125 µs.
The measured full workload latency is 173 microseconds, and through deconstruction, RTCP can be divided into 3 streams and 4 static segments:
  • PCIe-to-memory stream: 25 µs and 0.5 µs maximum and minimum latency respectively
  • Core-to-PCIe stream: 10 µs and 0.25 µs maximum and minimum latency respectively
  • PCIe-from-memory stream: 25 µs and 1 µs maximum and minimum latency respectively
  • Static compute overhead: 80 µs maximum latency
  • Static software overhead: 8 µs maximum latency
  • Static safety margin overhead: 12.5 µs
  • Static network overhead: 12.5 µs
The PCIe-to-memory stream makes up 41.7% of the tunable latency. The core-to-PCIe stream makes up 16.7% of the tunable latency. The PCIe-from-memory stream makes up 41.7% of the tunable latency.

Individual Stream Latency Calculation

The measurements described above enable the calculation of the target stream max latency for each stream.
Full Workload
Target Full Workload Latency
Streams
Target Stream Max Latency (µs)
Jitter Ratio
RTCP
125
PCIe to Memory
5.0
0.98
RTCP
125
Core to PCIe
2.0
0.975
RTCP
125
PCIe from Memory
5.0
0.96
RTCP
125
Compute Overhead
85.00
N/A
RTCP
125
Static SW Overhead
10.00
N/A
RTCP
125
Safety Margin Overhead
12.50
N/A
RTCP
125
Network Overhead
12.50
To reach the target in this case, the PCIe-to-memory stream must meet a target latency of 5.0 µs, the core-to-PCIe stream must meet a target latency of 2.0 µs, and the PCIe-from-memory stream must meet a target latency of 5.0 µs.
This is calculated by subtracting the static latencies (compute overhead, static software overhead, safety margin overhead, and network overhead) from the target full workload latency to get the tunable latency:
125 µs – 80.0 µs - 8 µs – 12.5 µs – 12.5 µs = 12 µs
Then multiplying the tunable latency by the percentage (%) of tunable latency for each stream.
12 µs * 41.7% = 5 µs (PCIe-to-memory target stream latency) 12 µs * 16.7% = 2 µs (core-to-PCIe target stream latency) 12 µs * 41.7% = 5 µs (PCIe-from-memory target stream latency)
As a user of the data streams optimizer, you would enter these latency targets in the “latency” field of the requirements file.
Jitter Ratio Adjustment
As mentioned in the previous example, the jitter ratio describes the ratio of latency jitter to maximum latency. A common characteristic of Intel® Architecture is that I/O streams with high jitter ratios tend to have more headroom for latency reduction from tuning.
This RTCP example shows PCIe to memory, core to PCIe, and PCIe from memory having jitter ratios of 0.98, 0.975, and 0.96 respectively.
Because all three of the tunable streams have a very high jitter ratio, it is likely that target latency can be lower than if they had a very low jitter ratio.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.