Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Packet Transmission Latency (PTL) Example

This example demonstrates how to calculate data stream latencies for a real-time component workload. A real-time component workload is one where only a portion has latency deadlines. The example uses a workload called Packet Transmission Latency (PTL).
PTL is a sample workload that involves data being transmitted over Ethernet from the host system to a remote system. This example assumes a discrete PCI Express Ethernet device: Intel® Ethernet Controller I210. The time-critical portion is from the moment a Tx function is initiated to when the packet with data is transmitted to the Ethernet PHY.

Target User for This Example

For real-time workloads consisting of a component transaction flow, a component integrator may be tasked with tuning a system to meet use case driven or customer defined deadlines. A component integrator is the engineer responsible for integrating a piece of code or a device into a specific production platform. This person would have deep insight into the functionality of the host hardware and the software interface.
Consider a media server that is streaming 1366 x 768 video content to a remote monitor. This workload is configured such that the server transmits a vertical line every 12.5 microseconds at ~60 frames per second. The monitor expects a new frame every ~17 milliseconds, so the server must transmit the frame data over Ethernet within that deadline. If the server misses its deadline, the monitor will read stale data and video performance will degrade. It may be the case that this media server has many other responsibilities, but none of its other tasks have strict real-time requirements.
The component integrator responsible for ensuring the Packet Transmission Latency falls within the latency deadlines should follow the PTL example in this guide.
  • Measure the full PTL workload
  • Deconstruct PTL into individual streams
  • Measure the individual stream latencies
  • Calculate the individual stream latency targets
On the other hand, a user-space application developer (that is, for example, implementing an application that makes function calls to the Ethernet driver) would not be the correct audience, because such a developer would not generally have the deep knowledge of the Ethernet controller hardware and software interface necessary to perform the workload deconstruction and stream latency calculation.

Deconstructing PTL into Streams

Some workloads seem straightforward at first glance, but can contain more streams upon closer inspection. PTL is one such workload. Memory mapped I/O memory operations on the core are slower than Direct Memory Access from I/O devices. Therefore, instead of transmitting data from the core to the Ethernet device controller, the core initiates a DMA read from the device to memory. The flow is described below:
  1. Software calls the Tx function.
  2. Core initiates packet transmission (core transmits MMIO write to the Ethernet controller).
  3. Device reads the descriptor from host memory (PCIe downstream read request to memory).
  4. Device receives the descriptor (PCIe upstream read completion with data to Ethernet controller).
  5. Device reads the packet from host memory (PCIe downstream read request to memory).
  6. Device receives the packet (PCIe upstream read completion with data to Ethernet controller).
  7. Device transmits the packet on the PHY.
Step 1 introduces software jitter into the latency measurements because during the Tx function call, the core may be interrupted or forced to switch tasks by the scheduler. Step 2 introduces a core-to-PCIe stream. Steps 3 through 6 entail a PCIe-from-memory stream as a result of the PCIe read requests and completions. Step 7 takes place entirely on the device hardware and cannot be tuned by the data streams optimizer.

Calculating PTL Stream Latency Requirements

The table below shows the latency numbers necessary to determine the stream requirements.
Full Workload
Measured Full Workload Latency (µs)
Streams
Measured Stream Max Latency (µs)
Measured Stream Min Latency (µs)
% of Tunable Streams
PTL
13
Core to PCIe
2
0.25
33.3%
PTL
13
PCIe from Memory
4
1
66.7%
PTL
13
Static SW Overhead
8
N/A
N/A
As a given for this example, the PTL full workload target latency is 12.5 microseconds (this target could come from component requirements, customer requirements, protocol specifications, etc.).
The measured full workload latency is 13 microseconds. Through deconstruction, PTL can be broken down into 2 streams (and 1 static segment):
  • Core-to-PCIe stream: 2 µs and .25 µs maximum and minimum latency respectively
  • PCIe-from-memory stream: 4 µs and 1 µs maximum and minimum latency respectively
  • Static software overhead: 8 µs maximum latency
In this example, the cumulative measured stream max latencies of the individual segments is higher than the measured full workload latency (14 µs versus 13 µs). This is a typical result that may occur due to: (1) The statistical improbability of multiple maximum latency events occurring simultaneously, and (2) overhead introduced by individual stream measurement instrumentation.
% of Tunable Streams is an intermediate value used to calculate the individual stream latency targets.
The core-to-PCIe stream makes up 33.3% of the tunable latency. The PCIe-from-memory stream makes up 66.7% of the tunable latency.

Individual Stream Latency Calculation

The measurements described above enable the calculation of the target stream max latency for each stream.
Full Workload
Target Full Workload Latency (µs)
Streams
Target Stream Max Latency (µs)
Jitter Ratio
PTL
12.5
Core to PCIe
1.50
0.88
PTL
12.5
PCIe from Memory
3.00
0.75
PTL
12.5
Static SW Overhead
8
N/A
To meet the 12.5 µs target full workload latency, the core-to-PCIe stream must meet a target latency of 1.5 µs, and the PCIe-from-memory stream must meet a target latency of 3.0 µs.
This is calculated by subtracting the static latency from the target full workload latencies to get the tunable latency:
12.5 µs – 8 µs = 4.5 µs
Then multiplying that by the % of tunable latency for each stream.
4.5 µs * 33.3% = 1.5 µs (Core-to-PCIe target stream latency)
4.5 µs * 66.7% = 3.0 µs (PCIe-from-memory target stream latency)
As a user of the data streams optimizer, you would enter these latency targets in the “latency_us” field of the requirements file.

Jitter Ratio Adjustment

The jitter ratio describes the ratio of latency jitter to max latency. A common characteristic of Intel® Architecture is that I/O streams with high jitter ratios tend to have more headroom for latency reduction from tuning.
If the target stream max latencies provided in the table above do not result in successful tuning, you can perform a jitter ratio adjustment to decrease the target latency of streams with higher jitter ratios, and increase the target latency of streams with lower jitter ratios.
Because this PTL example shows the core-to-PCIe stream and PCIe-to-memory stream having jitter ratios of .88 and .75 respectively, there may be more tuning headroom for the former than the latter.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.