Prepare for the Next Generation of Memory

Optimize your Application for Intel® Optane™ DC Persistent Memory

Applications are developing an insatiable appetite for DRAM memory. It is well known that the limited availability of system memory has a direct impact on performance for many software programs. To keep up with this demand, platforms have added more and more expensive memory since alternative solutions were not widely available… until now. Hybrid memory systems are not a new concept, but new technologies are rapidly improving their viability. A hybrid memory system usually consists of multiple levels of heterogeneous system memory between the CPU and disk. Intel® Optane™ DC persistent memory is a new type of nonvolatile memory/storage tier faster than SSDs or hard drives. It has latencies near DRAM speeds with much larger capacity (up to 512 GB per module). It will sit between these two levels in the memory hierarchy. It also has persistence capabilities, unlike traditional DRAM.

With the release of Intel Optane DC persistent memory, these hybrid systems will become widely available and it is important to have guidance and tools that help developers get the best performance on these platforms.

Visualization of memory hierarchy
Figure 1 - The New Memory Hierarchy


This article provides step-by-step instructions on how to use Intel® VTune™ Amplifier to determine whether an application may be a good candidate for using Intel Optane DC persistent memory as an affordable, high-capacity, volatile memory. It also provides some optimization guidelines to follow. Getting started with VTune Amplifier is as simple as a few clicks or executing a single command line. See Figure 2 or the VTune Amplifier Getting Started Guide for more details.

Setup Steps for VTune Amplifier
Figure 2 - Quick Steps to Get Started

1. Determine the memory footprint of the application

Run a Memory Consumption analysis with VTune Amplifier. This analysis will track all memory allocations made by the application. The report will show the over-time memory consumption and from there, the peak memory footprint can be determined (Figure 3). The highest value on the Y-Axis in the timeline is the peak memory footprint. In Figure 3, for example, it is around 1 GB.

Memory Consumption Report
Figure 3 - Memory Consumption Report


To get performance benefits from Intel Optane DC persistent memory, it is important that the application benefits from more physical memory. The memory footprint should be close to (within 90%) or greater than the amount of DRAM available on the system. Since physical memory is a finite resource, you need to consider the fact that the operating system and other processes also consume memory. If the memory footprint plus the expected usage of these other memory consumers is near the available DRAM size, it ensures that the application can use the Intel Optane DC persistent memory because it cannot fit all its data in DRAM.

Another, less detectable, application characteristic is whether your application can be scaled or modified to take advantage of more memory, if it were available. If you know that your application is currently bound by the amount of available DRAM, investigate the options below, even if it currently fits well into the available memory on your platform.

2. Identify the working set of your application

The memory footprint will tell you how much memory your application is consuming but does not indicate how much of that memory is used often, infrequently, or not at all. To optimize for Intel Optane DC persistent memory, it is important to identify the working set - the objects frequently accessed by your application.

Run a Memory Access Analysis in VTune Amplifier and select the option to “Analyze dynamic memory objects”. In the Bottom-up view of the GUI, you will see a grid that lists each memory object that was allocated by the application, its size in parenthesis, and the number of loads and stores that accessed it (Figure 4). Identify the objects with the most accesses (loads and stores). Sum up the sizes (the values in parenthesis) of these objects. The sum of the sizes of these objects will be the working set size. It is up you to determine exactly where to draw the line for what is and is not part of the working set.

Memory Access Analysis Report
Figure 4 - Memory Access Analysis report with Dynamic Memory Object Analysis

3a. Determine the best memory configuration for your system

From the size of the working set, you can determine the ideal system memory configuration for DRAM and Intel Optane DC persistent memory from a cost versus performance perspective. You want enough DRAM to comfortably cache your working set (hot objects based on loads and stores), and the large Intel Optane DC persistent memory will contain the complete application footprint.

Once you have determined the configuration for your application, you can try Intel Optane DC persistent memory in Memory Mode. This doesn’t require any software changes, and your application will automatically see the Intel Optane DC persistent memory as the total addressable system memory. The working set should routinely be cached in DRAM and the remaining data will sit in Intel Optane DC persistent memory instead of out on disk.

3b. Optimize your application for a known memory configuration

In addition to Memory Mode, Intel Optane DC Persistent Memory can be configured in App Direct mode. This allows the user to explicitly define which objects should be allocated in DRAM and which should be allocated in Intel Optane DC persistent memory. It is important to make educated choices because allocating incorrectly could hurt application performance. This allocation is usually handled via specific APIs, for example the allocation APIs available in the Intel Persistent Memory Development Kit (PMDK) and memkind library.

Identify the objects with the most last-level core cache (LLC) misses. Determine approximately how many of these will fit into DRAM and allocate them there. This ensures they will have the lowest access latency, as compared to the longer latency of Intel Optane DC persistent memory. For the remaining objects that have fewer LLC misses or are too large to put in DRAM, use allocation APIs to put them in Intel Optane DC persistent memory. These steps will ensure that your most accessed objects have the fastest path to the CPU (allocated in DRAM), while the infrequently accessed objects will take advantage of the additional Intel Optane DC persistent memory, as opposed to sitting out on disk, which is much slower.

Another consideration for optimizations is the load/store ratio for object accesses. Intel Optane DC persistent memory loads are generally much faster than stores. Identify objects with high load/store ratios (load heavy objects) and allocate them in persistent memory. Allocate the store heavy objects in DRAM.

There is no hard rule for what constitutes a hot/warm/cold object and behaviors will be application dependent, but these guidelines are a starting point for choosing how to allocate objects in persistent memory. After completing this process, start profiling and tuning the application to further improve the performance with persistent memory.

Bonus Topic: Identifying platform-level bottlenecks with Platform Profiler

You now know how to analyze an application using Intel VTune Amplifier to prepare for Intel Optane DC persistent memory. Additionally, you may want to analyze your entire system or get a high-level view of some set of workloads running on a platform. The new Platform Profiler feature in Intel VTune Amplifier is designed to do just that and may show you additional insights related to compute, memory, and disk performance.

For more complete information about compiler optimizations, see our Optimization Notice.