Cookbook

  • 2020
  • 06/18/2020
  • Public Content

IO Issues: Remote Socket Accesses

This recipe applies the General Exploration analysis of the Intel® VTune™ Amplifier to analyze a DPDK-based application for potential misconfiguration problems on a multi-socket system. The recipe can be used for any I/O bound workload.
Content experts
: Ilia Kurakin , Roman Khatko
General Exploration analysis was renamed to Microarchitecture Exploration analysis starting with Intel VTune Amplifier 2019.
The optimization technique used in this recipe relies on the Intel® Data Direct I/O Technology (Intel® DDIO), which is a feature of the Intel® Xeon® processor E5 family and Intel® Xeon® processor E7 v2 family. Intel DDIO makes an I/O device talk directly to the processor cache without accessing the main memory. This feature is enabled by default and is invisible for the software.
Currently, Intel DDIO dramatically increases the performance only for the local socket configuration. Hence, the I/O workload should be configured properly to use Intel DDIO advantages.
There is a distinction between two configurations:
  • Local socket
    : I/O device is attached directly to the socket where the I/O is consumed/produced.
  • Remote socket
    : I/O device and a core consuming/producing data belong to different sockets. I/O data has to traverse the Intel QuickPath Interconnect (Intel QPI) to reach the consuming core
The figures below illustrate an I/O flow in the local and remote socket topologies:
Local socket
Remote socket
The DPDK rigidly pins the polling process to the specific core. Thus, it is wise to pin only cores and ports belonging to the same socket to reduce latency and maximize bandwidth by utilizing Intel DDIO feature. Although, the complex system containing a big number of sockets, cores, and Ethernet devices may be easily configured non-optimally in terms of Intel DDIO usage.
This recipe demonstrates a remote socket access detection with the VTune Amplifier.

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.
  • Application
    : Intel® Data Plane Performance Demonstrators (Intel DPPD) PROX application that performs L2 forwarding to port 1 of packets received on port 0.
    PROX is configured in two ways as follows:
    Local socket: DPDK is pinned to a core on socket 0
    Remote socket: DPDK is pinned to a core on socket 1
  • Tools
    :
    • Intel VTune Amplifier 2018: General Exploration analysis
    • For
      VTune
      Profiler
      downloads and product support, visit https://software.intel.com/en-us/vtune .
    • All the Cookbook recipes are scalable and can be applied to Intel VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
    • Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler starting with its version for Intel® oneAPI Base Toolkit (Beta). You can still use a standalone version of the VTune Profiler, or its versions integrated into Intel Parallel Studio XE or Intel System Studio.
  • Operating system
    : Red Hat* Enterprise Linux Server 7.4
  • CPU
    : 2x Intel Xeon processor E5-2695 v4 with Intel DDIO, which is a dual-socket system consuming data packets from the NIC (System Under Test) and the traffic generator (GEN)
This system configuration is non-deterministic and is used as an example for this particular recipe. For VTune Amplifier software and hardware requirements, see the product Release Notes .

Run General Exploration Analysis

To categorize your microarchitecture performance issues, start with the General Exploration analysis:
  1. Find out the PID of the running PROX:
    ps aux | grep prox
  2. Run the General Exploration analysis with the VTune Amplifier command-line interface (
    amplxe-cl
    ) and attach to the running PROX process:
    amplxe-cl -collect general-exploration -knob collect-memory-bandwidth=true -r <
    result_dir
    > --duration 25 --target-pid <
    PID
    >

Analyze Remote Cache Usage

By default, the collected result opens in the General Exploration viewpoint. Start with the
Summary
window and focus on the
Remote Cache
metric, which is a basic indicator to determine a potential misconfiguration. This metric shows a percentage of clockticks utilized while getting the data from the remote cache.
In the perfect case (local socket), the Remote Cache metric is equal to zero:
Non-zero Remote Cache metric typically signals that a core was accessing the remote LLC. For the remote socket configuration, the Remote Cache metric value is 100% and VTune Amplifier flags it as a performance issue.
For further analysis, switch to the Memory Usage viewpoint and explore the
Remote Cache Access Count
metric that shows how many LLC misses were serviced by the remote cache. A high value of this metric indicates that a core and an I/O device were running on different sockets.
Compare metric values for the remote socket configuration:
And for the local socket configuration:

Identify Cores Accessing Remote Cache

To find out which cores accessed the remote cache, switch to the
Bottom-up
window in the Memory Usage viewpoint and choose a
Core
grouping level for the grid:
Note that the
Remote Cache
column is collapsed by default. Click the ">>" control on the right side of column name to expand child columns. The metric hierarchy in columns is the same as the metric hierarchy in the
Summary
window and in this case it starts with the
Memory Bound
group.
In this example, core_19 was accessing the remote LLC.

Re-Configure Your DPDK Application

If you identified remote cache issues with your DPDK application on Intel platforms, follow the configuration instructions provided in the
DPDK Getting Started guide
> How to get best performance with NICs on Intel platforms .
To discuss this recipe, visit the developer forum .

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804