Developer Guide

  • 2021.2
  • 06/11/2021
  • Public
Contents

Why Use the Data Streams Optimizer

Intel engineers continuously test real-time workloads in the lab environment and experiment with tuning knobs that most affect real-time performance. Some of these knobs are provided to customers in the Intel® 64 and IA-32 Architectures Software Developer’s Manual or platform-specific documents such as the External Design Specification (EDS) or BIOS Writer’s Guide. Other knobs are not published for multiple reasons including, but not limited to, supportability and portability.
Customers use the published tuning knobs to adjust the performance of the platform to their requirements and may participate in back-and-forth engagement with Intel for help with further tuning the platform based on Intel expertise. This process can be complex and time-consuming.
The data streams optimizer simplifies the tuning process. To accomplish this, it automates the tuning process and applies tuning configurations that adjust visible and hidden tuning knobs in the form of a series of register writes.

Finding a Balance Between Real-Time Performance and Power Management

The data streams optimizer meets workload-specific real-time performance requirements without overprovisioning the best-effort (non-real-time) capabilities and power management of the system. This fine-tuning is achieved by implementing a three-level platform tuning strategy that systematically reduces worst-case execution time (WCET) using an iterative process of elimination (commonly referred to as “knocking down the long pole in the tent”). Among these three levels, the tool eliminates the highest source of jitter, then validates whether those optimizations were sufficient to meet workload requirements, and repeats the process, until either success or failure (where failure suggests that the hard limits of the processor have been exceeded).
The tool’s tuning strategy entails that known interference vectors have been identified and can be eliminated or mitigated through platform optimizations. When these optimizations are ordered and weighted by estimated jitter reduction, a pattern emerges, showing three levels of tuning stratification (from highest to lowest estimated jitter reduction): power management, Intel® Time Coordinated Computing (Intel® TCC) features, and fabric tuning.

Power Management Tuning

The data streams optimizer addresses the conflict between power and performance. High throughput and low latency performance require running the CPU constantly at maximum frequency. Power management features reduce energy consumption by either placing the CPU into a low-power state or reducing the operating frequency.
For real-time applications that require consistent performance, power management features can negatively affect consistency by sporadically increasing latency when parts of the CPU either exit low-power states or lock phase-locked loops (PLLs) to increase frequency. However, for real-time use cases where low-power operation is also important, disabling all power management is counter-productive. The right balance of power management versus performance consistency is necessary to meet both of these goals.
See the following diagram for a visual description of the balance between power and performance.

Intel® TCC Feature Tuning

In general, Intel® TCC features are processor-level optimizations that entail major design impact across multiple subsystems on the processor. Intel® TCC features often aim to improve specific workloads or data flows (for example, PCIe-from-memory reads and CPU-to-memory writes), but have wide-spread negative side-effects on best-effort performance.
This behavior is situational. High-impact, narrow-scope improvements with wide-scope side-effects make these Intel® TCC features impractical to deploy in out-of-the-box, non-real-time applications, but targeted Intel® TCC feature tuning can significantly improve performance for real-time applications.

Fabric Tuning

Real-time performance is bounded by the worst-case execution or transaction transmission latency. One major factor that contributes to worst-case performance is contention for shared hardware resources such as in the processor cores, data buses, memory, and processor fabric. Real-time data streams may be forced to wait while resources are used by best-effort data streams. Arbitration is the mechanism that manages the utilization of shared resources between the various requesters.
Some of this arbitration occurs between processor subsystems (such as arbitrating between the CPU cores and the uncore), but the majority of arbitration occurs at the microarchitecture level, between small-scale subcomponents. In extremely precise real-time control applications, where a couple microseconds or less of jitter may cause a deadline violation, fine-tuned control of system arbitration may be required.
Tuning the platform for real-time performance can impact other subsystems, such as power, thermal, and the ability for the system to enter low-power states. Perform a full system analysis to determine the impact of the configuration on other performance metrics.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.