Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Sample Demo: MMIO Reads

This demo is intended to show you how to use the data streams optimizer, as well as provide an example of the performance benefits that the tool offers.
You will use the data streams optimizer to improve the performance of a sample workload by tuning the core-from-PCIe stream (MMIO reads). You will use provided samples of an environment file, requirements file, test workload, and workload validation script.
After trying this demo, you will have a better sense of how the tool works. You can then start to use the tool with your real-time workload instead of the sample workload and assess how the tool can help you achieve your requirements.

Scenario

As context for this demo, consider the following scenario:
Imagine a high-speed motion control application which begins its control loop with I/O from a PCIe attached industrial Ethernet controller such as Intel® Ethernet Controller I225 or integrated Time-Sensitive Networking (TSN) controller.
There is about 90 µs budgeted for the core to begin the control loop, by reading sensor data from the network card (MMIO read latency), including static software overhead. In the diagram above, the overall MMIO read latency is t
3
’ - t
0
’ = 90 µs.
Imagine that you have tried to run this scenario with Intel® TCC Mode enabled in BIOS. Intel® TCC Mode settings in BIOS already meet the MMIO read latency requirement of 90 µs, but you have found that power consumption exceeds your requirement. Intel® TCC Mode disables all power management settings and may negatively affect use cases that have power consumption requirements. The data streams optimizer can balance latency and power requirements. You decide to use the data streams optimizer to try to meet the MMIO read latency requirement of 90 µs, while also improving power consumption compared to Intel® TCC Mode.
The MRL Sample Workload is a proxy for such a scenario.
This demo does not measure power consumption, but power consumption can be measured using tools outside of Intel® TCC Tools, such as Intel® SoC Watch. Get Intel® SoC Watch from Intel® VTune™ Profiler 2021.4.0.

About MMIO Read Latency (MRL) Sample

In this demo, you will use MRL as the workload.
MRL is a sample application that simulates a portion of an industrial control loop with input (packet reception, Rx), compute, and output (packet transmission, Tx) segments. In particular, MRL measures how fast a core can begin the Rx segment by reading sensor data from a network interface card (NIC).
MRL uses the core-from-PCIe (MMIO reads) stream and measures the latency of the stream.

Target User for MRL

For real-time workloads representative of control loops, like MRL, a system integrator may be tasked with tuning a system to meet use case driven cycle times. This system integrator role would have a global system viewpoint of underlying hardware and software. They should have insight into the functionality of and interactions between the various system components within the critical path from data reception, to the compute workload, to the data transmission. This individual should have the knowledge (and/or have gathered all the data from other roles) and ability to:
  • Measure the MRL workload and individual stream latency
  • Calculate the individual stream latency target

Overview of the Sample Workload Validation Script

For the demo, you will use a sample workload validation script specific to the MRL workload.
The script does the following:
  1. Runs the MRL workload on the target system, according to the options specified at the command line.
  2. Compares the maximum latency measurement from the MRL workload with the latency requirement specified in the
    --latency_us
    command-line option.
  3. Returns 0 if the maximum latency measurement is equal to or lower than the latency requirement, indicating validation was successful. Returns non-zero otherwise.
For general specifications for all workload validation scripts and more details about this sample, see Create a Workload Validation Script.
The next section shows where the workload validation script appears in the requirements file.

Overview of the Sample MRL Requirements File

For the demo, you will use a sample requirements file specific to the MRL workload, with predefined values. The sample MRL requirements file is shown below:
{ "workload": { "command": "python3 /usr/share/tcc_tools/tools/demo/workloads/bin/mrl_validation_script.py", "arguments": [ "--device I225", "--iterations 10000000", "--latency_us 90", "--core 3" ] }, "requirements": { "compatibility": "SoftwareSRAM", "streams":[ { "producer": "aa:00.0", "traffic_class": 0, "consumer": "Core3", "latency_us": 90, "bytes_per_transfer": 4, "relative_priority": 0 } ] } }
In the requirements file, the “workload” fields specify the workload validation script command and arguments. After generating a tuning configuration, the tool will run the workload validation script. This demo uses a sample script specific to the MRL application.
  • “command” shows the sample script that will be used to validate MRL latency during the tool flow. The exit code must return 0 if the requirements are met and 1 if the requirements are not met. Any other value means the validation failed to complete, and the tuning process will stop. Any type of program is OK.
  • The
    --device
    argument specifies the device for the MMIO read. The device could be an Intel® Ethernet Controller I225 or an Ethernet controller with embedded Time-Sensitive Networking.
  • The
    --iterations
    argument specifies the number of iterations to run.
  • The
    --latency_us
    argument specifies the maximum latency requirement in microseconds (us) to be verified. As defined earlier in this scenario, the requirement is 90 microseconds. In the requirements file, this argument is used twice. The first time it is a validation script requirement to detect whether the test passes or fails. The second time is for the data streams optimizer to apply configurations that have a latency less than the value.
  • The
    --core
    argument specifies that the sample will run on processor core 3.
The “requirements” fields specify the data stream requirements that the tool will use to generate the aforementioned tuning configuration.
  • The compatibility option identifies which features are required to be used with the data streams optimizer, in this case software SRAM is specified. Specifying the compatibility option ensures tuning configurations selected by the data streams optimizer are compatible with specified features. Some tuning configurations enable power management options that are not compatible with software SRAM and RTCM. For this demo, the compatibility option is set to SoftwareSRAM to ensure there will be no compatibility issues if you choose to run this demo and use the cache allocation tools on the same system. Although this setting reduces the number of tuning configurations that the data streams optimizer can use, the remaining configurations are enough to tune the system to meet the demo’s requirements. For details about compatibility between the data streams optimizer and other features, see Compatibility between Data Streams Optimizer and Cache Allocation.
  • The data stream is specified in the form of a producer/consumer pair. For the core-from-PCIe stream, the “producer” value is always the PCIe device in Bus:Device.Function (BDF) notation with additional Traffic Class (TC) specified. The BDF value varies by PCIe device, and specific BDF for the demo will be inserted later depending on the target device. The “consumer” value is always the processor core in the format CoreNumber.
  • The “latency_us” and “bytes_per_transfer” fields specify the requirements for the data stream. For the core-from-PCIe stream, latency is defined as the maximum acceptable time needed to read the targeted MMIO addressable region and return the requested data in microseconds. The “bytes_per_transfer” value is defined as the burst size of the buffer that is part of the real-time control loop. In other words, amount of bytes read. For more information about generating requirements for this data stream and others, see Generate Requirements.
  • The “relative_priority” field is not used in this release.
For general specifications for all requirements files, see Create a Requirements File.

Next Steps

Now that you have read the scenario for this sample demo, walk through the steps of this demo to see the value of tuning.
  1. Step 1: MRL Setup: Set up the hardware and make sure the real-time kernel is running on the target.
  2. Step 2: Run MRL on Untuned System: Run the MRL workload on the untuned system to get the baseline latency measurement.
  3. Step 3: Preproduction: Generate a Tuning Configuration: Walk through the data streams optimizer preproduction steps. The tool will tune the system and show a performance improvement.
  4. Step 4: Production: Apply Tuning Configuration: Walk through the data streams optimizer production steps.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.