Get answers to your questions about persistent memory programming.
This Storage Networking Industry Association (SNIA)* specification defines recommended behavior between various user space and operating system kernel components supporting non-volatile memory (NVM). This specification does not describe a specific API. Instead, the intent is to enable common NVM behavior to be exposed by multiple operating system-specific interfaces. Some of the techniques used in this model are memory mapped files, direct access (DAX), and so on. For more information, refer to the SNIA NVM Programming Model.
DAX enables direct access to files stored in persistent memory or on a block device. Without DAX support in a file system, the page cache is generally used to buffer reads and writes to files, and requires an extra copy operation.
DAX removes the extra copy operation by performing reads and writes directly to the storage device. It is also used to provide the pages that are mapped into a user space by a call to mmap. For more information, refer to Direct Access for Files.
The persistent memory file system can detect whether or not there is DAX support in the kernel. If so, when an application opens a memory mapped file on this file system, it has direct access to the persistent region. Examples of persistent memory-aware file systems include EXT4, XFS on Linux*, and NTFS on Microsoft Windows Server*.
To get DAX support, the file system must be mounted with the dax mount option. For example, on the EXT4 file system, you can mount as follows:
mkfs –t ext4 /dev/pmem0
mount –o dax /dev/pmem0 /dev/pmem
Memory mapping of files is an old technique, and it plays an important role in persistent memory programming.
When you use memory mapping for a file, you are telling the operating system to map the file into memory, and then expose this memory region into the application's virtual address space.
For an application working with block storage, when you use memory mapping, this region is treated as byte-addressable storage. Behind the scenes, page caching occurs, which is where the operating system pauses the application to perform the I/O operation, but the underlying storage can only talk in blocks. So, even if a single byte is changed, the entire 4K block is moved to storage, which is not very efficient.
For an application working with persistent memory, the region of the file that uses memory mapping is treated as byte-addressable (cache line) storage, and page caching is eliminated.
In the context of visibility, atomicity is what other threads can see. In the context of power-fail atomicity, it is the size of the store that cannot be torn by a power failure or other interruption. In x86 processors, any store to memory has an atomicity guarantee of only eight bytes. In a real-world application, data updates may consist of chunks larger than eight bytes. Anything larger than eight bytes is not power-fail atomic and may result in a torn write.
The block translation table (BTT) provides atomic sector update semantics for persistent memory devices. It prevents torn writes for applications that rely on sector writes. The BTT manifests itself as a stacked block device and reserves a portion of the underlying storage for its metadata. It is an indirection table that remaps all the blocks on the volume. The BTT can be thought of as an extremely simple file system whose sole purpose is to provide atomic sector updates.
What are the challenges of adapting software for persistent memory?
The main challenges of implementing persistent memory support are:
When an application writes to persistent memory, it is not guaranteed to be persistent until it is in a power failure protected domain. To ensure that writes are in a failure protected domain, it is necessary to flush (+fence) after writing.
You can do this three ways:
Transactions can be used to update large chunks of data. If the execution of a transaction is interrupted, implementation of transactional semantics provides assurance to the application that power-failure atomicity of an annotated section of code is guaranteed.
No. As far as the processor is concerned, persistent memory is just memory and the processor can execute any type of instructions on persistent memory. The problem here is atomicity. Intel® TSX is implemented on the cache layer, so any flushes of the cache will naturally have to abort the transaction. If flushing does not occur until after the transaction succeeds, the failure atomicity and visibility atomicity may be out of sync.
The Persistent Memory Development Kit (PMDK), formerly known as the Non-Volatile Memory Library (NVML), is a collection of libraries and tools designed to support development of persistent-memory-aware applications. The open source PMDK project currently supports ten libraries, which are targeted at various use cases for persistent memory with language support for C, C++, Java*, and Python*. The PMDK also includes tools like the pmemcheck plug-in for the open source toolset, valgrind, and an increasing body of documentation, code examples, tutorials, and blog entries. The libraries are tuned and validated to production quality and are issued with a license that allows their use in both open and closed source products. The project continues to expand as new use cases are identified.
The PMDK is designed to solve persistent memory challenges and facilitate the adoption of persistent memory programming. It offers developers well-tested, production-ready libraries and tools in a comprehensive implementation of the Storage Networking Industry Association Non-Volatile Memory (SNIA NVM) programming model.
The PMDK is designed and optimized for byte-addressable persistent memory. These libraries can be used with non-volatile dual in-line memory modules (NVDIMM) such as NVDIMM-Ns in addition to Intel® Optane™ DC memory modules.
All the libraries are implemented in C, with custom bindings for the libpmemobj library in C++.
Yes. Libpmem is a simple library that detects the types of flush instructions supported by the processor. It uses the best instructions for the platform to create performance-tuned routines for copying ranges of persistent memory.
No. PMDK provides an interface to allocate and manage persistent memory.
The libraries were functionally validated on persistent memory emulated using DRAM. Testing on actual hardware is in progress.
Yes. For example, we added persistent memory support for Redis*, which enables additional configuration options for managing persistence. In particular, when running Redis in Append Only File mode, save all commands in a persistent memory-resident log file, instead of a plain-text append-only file stored on a conventional hard disk drive. Persistent memory resident log files are implemented in the libpmemlog library.
For implementation of Redis and build instructions, see the Libraries.io documentation.
Libpmem provides low-level persistent memory support. Use libpmen if you plan to handle persistent memory allocation and consistency across program interruptions yourself.
Most developers use libpmemobj, which provides:
Use libpmem to implement libpmemobj.
The difference is that pmem_persist does not copy anything, but only flushes data to persistence (out of the CPU cache). In other words:
pmem_memcpy_persist(dst, src, len) == memcpy(dst, src, len) + pmem_persist(dst, len)
The PMDK is designed and optimized for byte-addressable persistent memory while SSDs are block based. Running libpmemobj on SSDs requires translations from block to byte addressing. This adds additional time to a transaction. Also, it requires moving whole blocks from SSD to memory and back for reading and flushing writes.
Libpmemobj defines memory mapped regions as pools and they are identified by a layout. Each pool has a known location called root, and all the data structures are anchored off of root. When an application comes back from a crash it asks for the root object, from which the rest of the data can be retrieved.
Yes, libpmemobj supports both local and remote replication through the use of the sync option on the pmempool command or the pmempool_sync() API from the libpmempool(3) library.
There is no support for transactions that span multiple memory pools where each pool is of the same or a different type.
Libpmemobj maintains a generation number that gets increased each time a pmemobj pool is opened. When a pmem-aware lock is acquired, such as a PMEM mutex, the lock is checked against the pool's current generation number to see if this is the first use since the pool was opened. If so, the lock is initialized. So, if you have a thousand locks held and the machine crashes, all those locks are dropped because the generation number is incremented when the pools are open, and it is decremented when the pools are closed. This avoids having to find all the locks and iterate through them.
No. Pool management functions are not thread safe because we can't put the shared global state under a lock for runtime performance reasons.
No. The role of pmem_persis is to ensure that the passed memory region gets out of the processor caches without regard to what is stored in the region. Store and flush are separate operations. To store and persist atomically, perform the locking around both operations manually.
Libpmemobj uses roughly four kilobytes for each pool plus 512 kilobytes per 16 gigabytes of static metadata. For example, a 100 gigabyte pool would require 3588 kilobytes of static metadata. Additionally, each memory chunk (256 kilobytes) used for small allocations (less than or equal to two megabytes) uses 320 bytes of metadata. Also, each allocated object has a 64-byte header.
One way to ensure that you have persistent memory reserved before you use the pmempool is by using the command create. For more details, type the command man pmempool-create.
Create a 110 GB blk pool file.
$ pmempool create blk --size=110G pool.blk
Create the maximum allowed log pool file.
$ pmempool create log -M pool.log
No. Having multiple pools in a single file is not supported. Our libraries support concatenating multiple files to create a single pool.
Persistent memory pools do not grow automatically after creation. You can use a holey file to create a large pool, and then rely on the file system to do everything else. However, this is often seen as unsatisfactory as it is contrary to how traditional storage solutions work. For details, see Runtime extensible zones.
PMDK libraries rely on file system capability to support sparse files. This means that you create a file as large as you could possibly want, and the actual storage memory use would be only what is actually allocated.
No. The pmemobj_close() function closes the memory pool and does not delete the memory pool handle. The object store itself lives on in the file that contains it and may be reopened later.
To delete a pool, use one of the following options:
Yes. PMDK is platform neutral and vendor neutral, although these libraries are optimized to perform the best on Intel® Optane™ DC persistent memory.
The PMDK is not a requirement but a convenience for adopting persistent memory programming. You can use the PMDK libraries as binaries, or you can choose to reference the code in the libraries if you are implementing persistent memory access code from scratch.
Yes. The PMDK libraries, but not the tools, are included in Linux distributions from Suse*, Red Hat Enterprise Linux*, and Ubuntu*.
For Microsoft Windows, the PMDK libraries (but not the tools) are included in Windows Server* 2016 and Windows® 10. For details, see the pmem.io blog PMDK for Windows.
To get the complete PMDK, download it from the PMDK GitHub repository.
Currently only 64-bit Linux* and Windows* on x86 are supported.
No. The PMDK is designed and optimized for byte-addressable persistent memory devices only.
PMOF enables replication of data remotely between machines with persistent memory.
Librpmem and rpmemd implement persistent memory over fabric (PMOF). Librpmem is a library in the PMDK that will run on the initiator node and rpmemd is a new remote PMDK daemon that will run on each remote node that data is replicated to. The design makes use of the OpenFabrics Alliance (OFA) libfabric application-level API for the backend Remote Direct Memory Access (RDMA) networking infrastructure.
Link the application using the -lpmemlog option. This option is optimized for performance, skips checks that impact performance, and never logs any trace information or performs any run-time assertions.
Include the following:
The trace points in the debug version of the library are enabled using the environment variable PMEMLOG_LOG_LEVEL.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804