• 2020
  • 06/18/2020
  • Public Content

PMDK Application Overhead

This recipe shows how to detect and fix an overhead on memory accesses for a PMDK-based application.
Content expert
: Kirill Uhanov
Persistent Memory Development Kit (PMDK) provides support for transactional and atomic operations to keep the data consistent and durable. It is a collection of open source libraries and tools that are available today for both Linux* and Windows* OS. For more information, please visit , the Persistent Memory Programming web site. PMDK facilitates persistent memory programming adoption with higher level language support. Currently, C and C++ support is fully validated and delivered on Linux, and available as early access on Windows.
The new generation of persistent memory from Intel has introduced a third memory tier. In addition to the memory and storage tiers, the persistent memory tier offers greater capacity than DRAM and significantly faster performance than storage. Applications can access persistent memory-resident data structures in-place, like they do with traditional memory, eliminating the need to page blocks of data back and forth between memory and storage.
However, taking an advantage of PMDK libraries may influence your application performance. Explore this recipe as an example how Intel® VTune™ Amplifier can help detect such issues.


This section lists the hardware and software tools used for the performance analysis scenario:
  • Application:
    a sample application that calculates the sum of two vector element-wise using PMDK memory allocators.
  • Compiler
    : GNU* compiler with the following compiler/linker options:
    gcc -c -o array.o -O2 -g -fopenmp -I <
    >/src/include -I <
    >/src/examples array.c
    gcc -o arrayBefore array.o -fopenmp -L <
    >/src/nondebug -lpmemobj -lpmem -pthread
  • Performance analysis tools:
    Intel VTune Amplifier 2018: Memory Access / Advanced Hotspots analyses
    • For
      downloads and product support, visit .
    • All the Cookbook recipes are scalable and can be applied to Intel VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
    • Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler starting with its version for Intel® oneAPI Base Toolkit (Beta). You can still use a standalone version of the VTune Profiler, or its versions integrated into Intel Parallel Studio XE or Intel System Studio.
  • Operating system:
    Ubuntu* 16.04 LTS
  • CPU:
    Intel® Core™ i7-6700K CPU @ 4.00GHz

Run Memory Access Analysis for Your PMDK App

This recipe starts with a sample application that utilizes the persistent memory. This application uses a triad kernel from a well-known stream benchmark and should fully utilize the DRAM bandwidth.
In this sample, the vector sum calculation is repeated in the loop to make compute work more significant and measurable for statistical analysis:
#include <ex_common.h> #include <stdio.h> #include <stdlib.h> #include <sys/stat.h> #include <libpmemobj.h> #include <omp.h> #define REPEATS 32 POBJ_LAYOUT_BEGIN(array); POBJ_LAYOUT_TOID(array, int); POBJ_LAYOUT_END(array); int main() { size_t size = 82955000; size_t pool_size = 16200000000; int i,j; int multiplier = 3; PMEMobjpool *pop; char* path = "test_file1"; if (file_exists(path) != 0) { if ((pop = pmemobj_create(path, POBJ_LAYOUT_NAME(array), pool_size, CREATE_MODE_RW)) == NULL) { printf("failed to create pool\n"); return 1; } } else { if ((pop = pmemobj_open(path, POBJ_LAYOUT_NAME(array))) == NULL) { printf("failed to open pool\n"); return 1; } } TOID(int) a; TOID(int) b; TOID(int) c; POBJ_ALLOC(pop, &a, int, sizeof(int) * size, NULL, NULL); POBJ_ALLOC(pop, &b, int, sizeof(int) * size, NULL, NULL); POBJ_ALLOC(pop, &c, int, sizeof(int) * size, NULL, NULL); for (i = 0; i < size; i++) { D_RW(a)[i] = (int)i; D_RW(b)[i] = (int)i+100; D_RW(c)[i] = (int)i+3; } pmemobj_persist(pop, D_RW(a), size * sizeof(*D_RW(a))); pmemobj_persist(pop, D_RW(b), size * sizeof(*D_RW(b))); pmemobj_persist(pop, D_RW(c), size * sizeof(*D_RW(c))); for (j = 0; j < REPEATS; j++) { #pragma omp parallel for for (i = 0; i < size; i++) { D_RW(c)[i] = multiplier * D_RO(a)[i] + D_RO(b)[i]; } } POBJ_FREE(&a); POBJ_FREE(&b); POBJ_FREE(&c); pmemobj_close(pop); return 0; }
To identify performance issues in the sample code and estimate the time spent on memory accesses, launch the VTune Amplifier and run the Memory Access analysis:
  1. Click the
    New Project
    button on the toolbar and specify a name for the new project, for example:
  2. In the
    Analysis Target
    window, select the
    local host
    target system for the host-based analysis.
  3. Select the
    Launch Application
    target type and specify an application for analysis on the right pane.
  4. Click the
    Choose Analysis
    button on the right, select
    Microarchitecture Analysis > Memory Access
    on the left pane and click
    to run the analysis.
    VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Identify Hotspots for the PMDK-based App

Start your analysis with the
view that provides application-level statistics per hardware metrics. Typically, the basic performance baseline is the application Elapsed time, which is equal to ~16 seconds for this sample code.
In spite of the expected high DRAM utilization for the PMDK code, the summary metrics do not define this sample app as DRAM bandwidth bound:
Bandwidth Utilization Histogram
also shows that the application underutilized the DRAM bandwidth with the
Observed Maximum
about 13 GB/sec, which is much less than expected:
Obviously, the PMDK has introduced some overhead into the code. For details, switch to
view and choose the
Function / Call Stack
grouping level:
The largest hotspot is
. This is a function called inside
macros. Double-click the function to view the source code in
#define DIRECT_RW(o) \ (reinterpret_cast < __typeof__((o)._type) > (pmemobj_direct((o).oid))) #define DIRECT_RO(o) \ (reinterpret_cast < const __typeof__((o)._type) > \ (pmemobj_direct((o).oid))) #endif /* (defined(_MSC_VER) || defined(__cplusplus)) */ #define D_RW DIRECT_RW #define D_RO DIRECT_RO
To better visualize the DRAM bandwidth utilization during the application run, explore the
view. DRAM Bandwidth shows up in green and blue.

Remove Redundant PMDK Function Calls

Since the memory for each array is allocated as one chunk, it is enough to call
only once before the calculation to get the array start addresses:
int* _c = D_RW(c); const int* _a = D_RO(a); const int* _b = D_RO(b); for (j = 0; j < REPEATS; j++) { #pragma omp parallel for for (i = 0; i < size; i++) { _c[i] = multiplier * _a[i] + _b[i]; } }
Re-compile the application and re-run the Memory Access analysis to see how this change affected the performance:
You see that the Elapsed time of the application has significantly reduced. PMDK overhead does not influence the performance.
Bandwidth Utilization Histogram
shows that the application fully utilizes DRAM bandwidth with the
Observed Maximum
about 25 GB/sec:
To discuss this recipe, visit the developer forum .

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804