PCIe Traffic in DPDK Apps
- Application: DPDKtestpmdapp running on one core and performing L2 forwarding. The application is compiled against DPDK with profiling enabled byIntel® VTune™.Profiler
- Performance analysis tools:
- Intel® VTune™2019 Update 3: Input and Output analysisProfiler
- All the Cookbook recipes are scalable and can be applied to Intel VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
- Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler starting with its version for Intel® oneAPI Base Toolkit (Beta). You can still use a standalone version of the VTune Profiler, or its versions integrated into Intel Parallel Studio XE or Intel System Studio.
- System setup: a traffic generator and a system under test, where thetestpmdapp performs packet forwarding and whereIntel® VTune™collects performance data.Profiler
- CPU: Intel® Xeon® Platinum 8180 (38.5M Cache, 2.5 GHz, 28 cores)
Understand Inbound / Outbound PCIe Bandwidth Metrics
- Inbound PCIe Bandwidthcaused by device transactions targeting the system memory
- Inbound Readsshow device reads from the memory
- Inbound Writesshow device writes to the memory
- Outbound PCIe Bandwidthcaused by CPU transactions targeting device's MMIO space
- Outbound Readsshow CPU reads from device's MMIO space
- Outbound Writesshow CPU writes to the device's MMIO space
Configure and Run Input and Output Analysis
amplxe-cl -collect io -knob kernel-stack=false -knob dpdk=true -knob collect-memory-bandwidth=false --target-process my_process
Understand PCIe Transfers Required for Packet Forwarding
- The core prepares the Rx queue and starts polling the Rx queue tail.
- The NIC reads an Rx descriptor in the Rx queue head(Inbound Read).
- The NIC delivers the packet to the address specified in the Rx descriptor(Inbound Write).
- The NIC writes back the Rx descriptor to notify the core that the new packet arrived(Inbound Write).
- The core processes the packet.
- The core frees the Rx descriptor and moves the Rx queue tail pointer(Outbound Write).
- The core updates the Tx descriptor in the Tx queue tail.
- The core moves the Tx queue tail pointer(Outbound Write).
- The NIC reads the Tx descriptor(Inbound Read).
- The NIC reads the packet(Inbound Read).
- The NIC writes back the Tx descriptor to notify the core that the packet is transmitted and the Tx descriptor can be freed(Inbound Write).
Understand PCIe Traffic Optimizations
- No Outbound Reads.No expensive Outbound Reads (MMIO Reads) are needed to understand Rx and Tx queues head position. Instead, the NIC writes back Rx and Tx descriptors to notify software that the head position moves.
- Decreased Inbound Write Bandwidth related to the Tx descriptors.Tx descriptor write back is required to notify the core where the Tx queue head is and which Tx descriptors can be reused. In case of packet receiving, it is critical to write back each Rx descriptor to notify the core about a new arrived packet as soon as possible. In packet transmitting, there is no need to write back each Tx descriptor. It is sufficient to notify the core about successful packet transmission periodically (for example, on every 32nd packet), which would mean that all previous packets are transmitted successfully too. The NIC writes back the Tx descriptor when the RS (Report Status) bit of the Tx descriptor is set. On the DPDK side, there is aRS bit threshold; its value defines how frequently the RS bit is set and thus how frequently the NIC writes back Tx descriptors. This optimization amortizes Inbound Writes related to the Tx descriptors.
- Amortized Outbound Writes. The DPDK performs packet receiving and transmitting in batches, and application updates tail pointers after a batch of packets has been processed. Some implementations ofrx_burstuse theRx free threshold. This threshold enables setting the number of Rx descriptors processed before the app updates the Rx queue tail pointer (note that threshold becomes effective only when it is greater than the batch size). That way, the Outbound Writes are averaged among a number of packets.
Estimate PCIe Bandwidth Consumption
Compare Estimations vs. Analysis
Packet Size, B
Rx Descriptor Size, B
RS Bit Threshold
Rx Free Threshold