This recipe applies the General Exploration analysis of the Intel® VTune™ Amplifier to analyze a DPDK-based application for potential misconfiguration problems on a multi-socket system. The recipe can be used for any I/O bound workload.
General Exploration analysis was renamed to Microarchitecture Exploration analysis starting with VTune Amplifier 2019.
The optimization technique used in this recipe relies on the Intel® Data Direct I/O Technology (Intel® DDIO), which is a feature of the Intel® Xeon® processor E5 family and Intel® Xeon® processor E7 v2 family. Intel DDIO makes an I/O device talk directly to the processor cache without accessing the main memory. This feature is enabled by default and is invisible for the software.
Currently, Intel DDIO dramatically increases the performance only for the local socket configuration. Hence, the I/O workload should be configured properly to use Intel DDIO advantages.
There is a distinction between two configurations:
- Local socket: I/O device is attached directly to the socket where the I/O is consumed/produced.
- Remote socket: I/O device and a core consuming/producing data belong to different sockets. I/O data has to traverse the Intel QuickPath Interconnect (Intel QPI) to reach the consuming core
The figures below illustrate an I/O flow in the local and remote socket topologies:
The DPDK rigidly pins the polling process to the specific core. Thus, it is wise to pin only cores and ports belonging to the same socket to reduce latency and maximize bandwidth by utilizing Intel DDIO feature. Although, the complex system containing a big number of sockets, cores, and Ethernet devices may be easily configured non-optimally in terms of Intel DDIO usage.
This recipe demonstrates a remote socket access detection with the Intel® VTune™ Amplifier.
This section lists the hardware and software tools used for the performance analysis scenario.
Application: Intel® Data Plane Performance Demonstrators (Intel DPPD) PROX application that performs L2 forwarding to port 1 of packets received on port 0.
PROX is configured in two ways as follows:
Local socket: DPDK is pinned to a core on socket 0
Remote socket: DPDK is pinned to a core on socket 1
- Intel® VTune™ Amplifier 2018: General Exploration analysis
For trial VTune Amplifier downloads and product support, visit https://software.intel.com/en-us/vtune.
All the Cookbook recipes are scalable and can be applied to VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
- Operating system: Red Hat* Enterprise Linux Server 7.4
- CPU: 2x Intel Xeon processor E5-2695 v4 with Intel DDIO, which is a dual-socket system consuming data packets from the NIC (System Under Test) and the traffic generator (GEN)
This system configuration is non-deterministic and is used as an example for this particular recipe. For VTune Amplifier software and hardware requirements, see the product Release Notes.
Run General Exploration Analysis
To categorize your microarchitecture performance issues, start with the General Exploration analysis:
Find out the PID of the running PROX:ps aux | grep prox
Run the General Exploration analysis with the VTune Amplifier command-line interface (amplxe-cl) and attach to the running PROX process:amplxe-cl -collect general-exploration -knob collect-memory-bandwidth=true -r <result_dir> --duration 25 --target-pid <PID>
Analyze Remote Cache Usage
By default, the collected result opens in the General Exploration viewpoint. Start with the Summary window and focus on the Remote Cache metric, which is a basic indicator to determine a potential misconfiguration. This metric shows a percentage of clockticks utilized while getting the data from the remote cache.
In the perfect case (local socket), the Remote Cache metric is equal to zero:
Non-zero Remote Cache metric typically signals that a core was accessing the remote LLC. For the remote socket configuration, the Remote Cache metric value is 100% and VTune Amplifier flags it as a performance issue.
For further analysis, switch to the Memory Usage viewpoint and explore the Remote Cache Access Count metric that shows how many LLC misses were serviced by the remote cache. A high value of this metric indicates that a core and an I/O device were running on different sockets.
Compare metric values for the remote socket configuration:
And for the local socket configuration:
Identify Cores Accessing Remote Cache
To find out which cores accessed the remote cache, switch to the Bottom-up window in the Memory Usage viewpoint and choose a Core grouping level for the grid:
Note that the Remote Cache column is collapsed by default. Click the ">>" control on the right side of column name to expand child columns. The metric hierarchy in columns is the same as the metric hierarchy in the Summary window and in this case it starts with the Memory Bound group.
In this example, core_19 was accessing the remote LLC.
Re-Configure Your DPDK Application
If you identified remote cache issues with your DPDK application on Intel platforms, follow the configuration instructions provided in the DPDK Getting Started guide > How to get best performance with NICs on Intel platforms.
To discuss this recipe, visit the VTune Amplifier developer forum.