Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Generate Requirements

Before using the data streams optimizer for your use case, you will need to decompose your workload into data streams and requirements that the tool can address.
The steps below show how to generate these data stream requirements. After you have generated the requirements, you will need to input them in the requirements file. For details about the requirements file and example values in sample requirements files, see Create a Requirements File.

Select the Data Streams

  1. Read about the Data Streams Supported in this release.
  2. Identify all of the active data streams exercised in your real-time application.

Select a Data Streams Optimizer Option

As a prerequisite for determining performance requirements, it is assumed you have run your real-time workload in a tuned software environment (see Software Tuning) and found that the workload encounters timing violations. If the system has been sufficiently tuned to minimize software jitter, such as kernel overhead and interrupts, and the workload still fails to meet real-time performance requirements, you can use the data streams optimizer to apply hardware tuning for additional optimizations such that the workload and stream requirements may be met.
The data streams optimizer requires two main input parameters:
  • Bytes per transfer for the data stream
  • Maximum tolerable latency for the number of bytes transferred
This section provides guidance on how to identify the latency and bytes-per-transfer requirements for each stream if you do not already know them. While you can perform the in-depth analysis prescribed in Data Stream Latency Requirements Examples, this could be a complex and time-consuming process. To help you get started quickly, Intel offers predetermined latency and bytes-per-transfer values as an alternative.
Select one of the following options:
  • Option A
    : Start with provided preset values and iterate as needed (“guess-and-check” approach)
    If you don’t know the bytes per transfer or maximum tolerable latency, you can use the predetermined values from Intel as your starting point. See the example values below and the sample requirements files listed in Samples. From the predetermined values, select a requirements file and edit the bytes-per-transfer and latency requirements until you reach the desired performance.
    For example, if you are targeting the core-from-PCIe stream (MMIO reads), you might select the
    single_corepcierd_1.json
    sample requirements file with 90-microsecond latency. Then, you can use the workload validation script with the MRL sample application, or replace it with your own workload, and see if the script produces a successful outcome. If so, this starting point value is sufficient.
    If the script fails and the workload exceeds the latency requirement, the tool will automatically and iteratively try tuning configurations that correspond to lower and lower latencies, until validation succeeds or there are no more tuning configurations available.
    You can also experiment with the bytes-per-transfer requirement. You can edit this value by sizes aligned with the PCIe specifications: 4096 bytes, 2048 bytes, 1024 bytes, 512 bytes, 256 bytes, 128 bytes, and 64 bytes or any other size such as 500 bytes, 28000 bytes or 67100 bytes.
    If the workload or stream requirement has not been achieved and the data streams optimizer has exhausted all tuning options, consider optimization of compute overhead with the cache allocation library, network overhead, software overhead, and isolation to limit the impact of interfering workloads.
  • Option B
    : Deconstruct your workload to measure and calculate the stream latency requirements specific to your needs.
    See Data Stream Latency Requirements Examples for more detailed guidance about decomposing a full workload into individual data streams, identifying safety margin, static and network overhead, measuring the workload and stream latencies, and finally determining the target workload and stream latencies to be entered into a requirements file. This more in-depth explanation will use two example workloads, Real-Time Compute Performance (RTCP) and Packet Transmission Latency (PTL), to demonstrate how to quantify measured and target workload and stream latencies.
Depending on your overall workload requirements, you may need to iteratively adjust the individual stream latency value to tune the platform appropriately.
The next sections provide use case examples that give more details about the bytes-per-transfer and latency requirements for each stream. The use case examples use predetermined values as a starting point to help you determine your bytes-per-transfer and latency values.

Determine Bytes-per-Transfer and Latency Requirements

The tables below provide use case examples with reasonable values for the bytes-per-transfer and latency requirements. Use the data in the tables to help you determine the bytes-per-transfer and latency requirements values for your system. Then:
  • Enter the bytes per transfer value in the “bytes per transfer” field of the requirements file.
  • Enter the latency value in the “latency_us” field of the requirements file.
The PCIe use cases assume the following PCIe speed and link widths:
  • 11th Gen Intel® Core™ processors: Gen 3 speed and x4 link width
  • Intel Atom® x6000E Series processors: Gen 3 speed and x2 link width
Your PCIe device may be faster or slower. If so, you may choose to scale your PCIe latency calculations to ensure greater accuracy.
PCIe-from-Memory Stream (PCIe Device Reads from Memory)
For the PCIe-from-memory stream, the bytes-per-transfer requirement is defined as the burst size of the buffer that is part of the real-time control loop; the amount of bytes read.
The latency requirement is defined as the maximum acceptable time in microseconds to access the buffer or bytes transferred.
Example Use Case
Bytes-per-Transfer Value for Requirements File
Latency Value for Requirements File
An I/O device needs to retrieve 1024 bytes of a Tx buffer from main memory in less than 10 microseconds.
1024
10
An I/O device needs to retrieve 256 bytes of a Tx buffer from main memory in less than 5 microseconds.
256
5
PCIe-to-Memory Stream (PCIe Device Writes to Memory)
For the PCIe-to-memory stream, the bytes-per-transfer requirement is defined as the burst size of the buffer that is part of the real-time control loop; the amount of bytes written.
The latency requirement is defined as the maximum acceptable time in microseconds to access the buffer or bytes transferred. The latency of this stream is estimated to be equivalent to half of the roundtrip PCIe-from-memory stream latency since PCIe writes are posted transactions.
Example Use Case
Bytes-per-Transfer Value for Requirements File
Latency Value for Requirements File
An I/O device needs to modify 1024 bytes of an Rx buffer from main memory in less than 5 microseconds.
1024
5
An I/O device needs to modify 2048 bytes of an Rx buffer from main memory in less than 10 microseconds.
2048
10
Core-from-PCIe Stream (Core Issues MMIO Reads to PCIe Device)
For the core-from-PCIe stream via MMIO reads, the bytes-per-transfer requirement is defined as the burst size that is part of the real-time control loop. The bytes per transfer for this stream is typically the size of a single MMIO read (4 bytes) because current platforms can support up to a single outstanding MMIO read per core.
The latency requirement is defined as the maximum acceptable time in microseconds needed to read the targeted MMIO addressable region and return the requested data.
Example Use Case
Bytes-per-Transfer Value for Requirements File
Latency Value for Requirements File
The core needs 2 microseconds to read a PCIe device’s MMIO addressable space.
4
2
Core-from-PCIe Stream (Core Services MSI Generated by PCIe Device)
For the core-from-PCIe stream via PCIe MSIs, the bytes-per-transfer requirement is defined as the burst size of the MSI interrupts issued from a PCIe endpoint. The bytes per transfer for this stream is typically 4 bytes.
The latency requirement is defined as the maximum acceptable time in microseconds for an MSI generated by a PCIe endpoint to be serviced by the core.
Example Use Case
Bytes-per-Transfer Value for Requirements File
Latency Value for Requirements File
A network interface card (NIC) plugged into the PCIe controller processes incoming network traffic and generates an interrupt to a core periodically when new data has arrived. The core must service the interrupt via its interrupt service routine (ISR) and return the modified data within 10 microseconds.
4
10
Core-to-PCIe Stream via MMIO Writes (Core Writes to PCIe Device)
For the core-to-PCIe stream, the bytes-per-transfer requirement is defined as the burst size that is part of the real-time control loop, or the amount of MMIO writes. The bytes per transfer for this stream is typically 4 bytes.
The latency requirement is defined as the maximum acceptable time in microseconds needed to modify the targeted MMIO addressable region. The latency of this stream is estimated to be equivalent to half of the roundtrip core-from-memory (MMIO reads) stream latency since MMIO writes are posted transactions.
Example Use Case
Bytes-per-Transfer Value for Requirements File
Latency Value for Requirements File
The core needs 1 microsecond to modify a PCIe device’s MMIO addressable space.
4
1

Multiple Streams

The data streams optimizer supports tuning of more than one stream at a time.
To tune more than one stream, specify the requirements for each stream in a single requirements file.
The tuning process is as follows:
  1. The data streams optimizer selects an initial tuning configuration for each stream and then combines them into a single resulting tuning configuration.
  2. The data streams optimizer applies the resulting tuning configuration to the target system and runs the workload validation script. If the workload validation script passes, the tuning process is complete.
  3. If the workload validation script fails, the data streams optimizer selects the next configuration for each stream, combines them into one configuration, and starts the next iteration of validation. If there is no next configuration available for a particular stream, the last available configuration will be selected again. If no other configurations are available for all streams, the tuning process is considered as failed.
In the worst case, the total duration of the tuning process is determined by the stream with the maximum number of available tuning configurations. The duration does not depend on the number of streams.
Consider the following example of tuning multiple streams:
Example:
The requirements file is in the package:
multiple_pciememrd_pciememwr_0.json
The individual requirements for each of the streams in this example would be calculated according to the single-stream descriptions above.
Stream
Producer
Consumer
Latency
Bytes Per Transfer
Relative Priority
PCIe from Memory
Memory
BDF 01:00.0, TC 0
5
64
1
PCIe to Memory
BDF 02:00.0, TC 1
Memory
15
64
2
The two streams tuned in this case will be PCIe from memory and PCIe to memory.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.