- What is the SNIA* NVM Programming Model?
This Storage Networking Industry Association (SNIA)* specification defines recommended behavior between various user space and operating system kernel components supporting non-volatile memory (NVM). This specification does not describe a specific API. Instead, the intent is to enable common NVM behavior to be exposed by multiple operating system-specific interfaces. Some of the techniques used in this model are memory mapped files, direct access (DAX), and so on. For more information, refer to the SNIA NVM Programming Model.
- What is DAX?
DAX enables direct access to files stored in persistent memory or on a block device. Without DAX support in a file system, the page cache is generally used to buffer reads and writes to files, and requires an extra copy operation.
DAX removes the extra copy operation by performing reads and writes directly to the storage device. It is also used to provide the pages that are mapped into a user space by a call to mmap. For more information, refer to Direct Access for Files.
- What does a persistent memory-aware file system do?
The persistent memory file system can detect whether or not there is DAX support in the kernel. If so, when an application opens a memory mapped file on this file system, it has direct access to the persistent region. Examples of persistent memory-aware file systems include EXT4, XFS on Linux*, and NTFS on Microsoft Windows Server*.
To get DAX support, the file system must be mounted with the dax mount option. For example, on the EXT4 file system, you can mount as follows:
mkfs –t ext4 /dev/pmem0
mount –o dax /dev/pmem0 /dev/pmem
- How is memory mapping of files different on byte-addressable persistent memory?
Memory mapping of files is an old technique, and it plays an important role in persistent memory programming.
When you use memory mapping for a file, you are telling the operating system to map the file into memory, and then expose this memory region into the application's virtual address space.
For an application working with block storage, when you use memory mapping, this region is treated as byte-addressable storage. Behind the scenes, page caching occurs, which is where the operating system pauses the application to perform the I/O operation, but the underlying storage can only talk in blocks. So, even if a single byte is changed, the entire 4K block is moved to storage, which is not very efficient.
For an application working with persistent memory, the region of the file that uses memory mapping is treated as byte-addressable (cache line) storage, and page caching is eliminated.
- What is atomicity?
In the context of visibility, atomicity is what other threads can see. In the context of power-fail atomicity, it is the size of the store that cannot be torn by a power failure or other interruption. In x86 processors, any store to memory has an atomicity guarantee of only eight bytes. In a real-world application, data updates may consist of chunks larger than eight bytes. Anything larger than eight bytes is not power-fail atomic and may result in a torn write.
- Why do we need a block translation table (BTT) to manage sector atomicity?
The block translation table (BTT) provides atomic sector update semantics for persistent memory devices. It prevents torn writes for applications that rely on sector writes. The BTT manifests itself as a stacked block device and reserves a portion of the underlying storage for its metadata. It is an indirection table that remaps all the blocks on the volume. The BTT can be thought of as an extremely simple file system whose sole purpose is to provide atomic sector updates.
What are the challenges of adapting software for persistent memory?
The main challenges of implementing persistent memory support are:
- Ensuring data persistence and consistency
- Detecting and handling persistent memory errors
- What is the importance of flushing (+fence)?
When an application writes to persistent memory, it is not guaranteed to be persistent until it is in a power failure protected domain. To ensure that writes are in a failure protected domain, it is necessary to flush (+fence) after writing.
- How do I flush CPU caches from a user space?
You can do this three ways:
- Use CLFLUSH to flush one cache line at a time. CLFLUSH is a serialized instruction for historical reasons, so if you have to flush a range of persistent memory, looping through it and using CLFLUSH means flushes happen one after another.
- Use CLFLUSHOPT to flush multiple cache lines in parallel. Follow this instruction with SFENCE, since it is weakly ordered. For more details on the instructions, search for the topic CLFLUSH—Flush Cache Line in the document Intel® 64 and IA-32 Architectures - Software Developer's Manual - Combined Volumes.
- Use CLWB, which behaves like CLFLUSHOPT except that the cache line may remain valid in the cache.
- Why are transactions important?
Transactions can be used to update large chunks of data. If the execution of a transaction is interrupted, implementation of transactional semantics provides assurance to the application that power-failure atomicity of an annotated section of code is guaranteed.
- Can Intel® Transactional Synchronization Extensions (Intel® TSX) instructions be used for persistent memory?
No. As far as the processor is concerned, persistent memory is just memory and the processor can execute any type of instructions on persistent memory. The problem here is atomicity. Intel® TSX is implemented on the cache layer, so any flushes of the cache will naturally have to abort the transaction. If flushing does not occur until after the transaction succeeds, the failure atomicity and visibility atomicity may be out of sync.
Persistent Memory Development Kit (PMDK) Basics
- What is the PMDK?
The Persistent Memory Development Kit (PMDK), formerly known as the Non-Volatile Memory Library (NVML), is a collection of libraries and tools designed to support development of persistent-memory-aware applications. The open source PMDK project currently supports ten libraries, which are targeted at various use cases for persistent memory with language support for C, C++, Java*, and Python*. The PMDK also includes tools like the pmemcheck plug-in for the open source toolset, valgrind, and an increasing body of documentation, code examples, tutorials, and blog entries. The libraries are tuned and validated to production quality and are issued with a license that allows their use in both open and closed source products. The project continues to expand as new use cases are identified.
- Why Use the PMDK?
The PMDK is designed to solve persistent memory challenges and facilitate the adoption of persistent memory programming. It offers developers well-tested, production-ready libraries and tools in a comprehensive implementation of the Storage Networking Industry Association Non-Volatile Memory (SNIA NVM) programming model.
- What is the difference between the Storage Performance Development Kit (SPDK) and the PMDK?
The PMDK is designed and optimized for byte-addressable persistent memory. These libraries can be used with non-volatile dual in-line memory modules (NVDIMM) such as NVDIMM-Ns in addition to Intel® Optane™ DC memory modules.
- The SPDK is a set of libraries for writing high-performance storage applications that use block I/O.
- The PMDK is focused on persistent memory and SPDK is focused on storage, but the two sets of libraries work fine together if needed.
- What language bindings are provided for PMDK?
All the libraries are implemented in C, with custom bindings for the libpmemobj library in C++.
- Does the PMDK have a library that accesses persistent memory?
Yes. Libpmem is a simple library that detects the types of flush instructions supported by the processor. It uses the best instructions for the platform to create performance-tuned routines for copying ranges of persistent memory.
- Which libraries support transactions?
- Libpmemobj provides a transactional object store, providing memory allocation, transactions, and general facilities for persistent memory programming.
- Libpmemlog provides a pmem-resident log file. This is useful for programs like databases that append frequently to a log file.
- Libpmemblk supports arrays of pmem-resident blocks, all the same size, that are atomically updated. For example, a program keeping a cache of fixed-size objects in pmem might find this library useful.
- Can I use malloc to allocate persistent memory?
No. PMDK provides an interface to allocate and manage persistent memory.
- How are the PMDK libraries tested?
The libraries were functionally validated on persistent memory emulated using DRAM. Testing on actual hardware is in progress.
- Are there examples of real-world applications using the PMDK?
Yes. For example, we added persistent memory support for Redis*, which enables additional configuration options for managing persistence. In particular, when running Redis in Append Only File mode, save all commands in a persistent memory-resident log file, instead of a plain-text append-only file stored on a conventional hard disk drive. Persistent memory resident log files are implemented in the libpmemlog library.
For implementation of Redis and build instructions, see the Libraries.io documentation.
- When should I use libpmem versus libpmemobj?
Libpmem provides low-level persistent memory support. Use libpmen if you plan to handle persistent memory allocation and consistency across program interruptions yourself.
Most developers use libpmemobj, which provides:
- A transactional object store
- Memory allocation
- General facilities for persistent memory programming
Use libpmem to implement libpmemobj.
- What is the difference between pmem_memcpy_persist and pmem_persist?
The difference is that pmem_persist does not copy anything, but only flushes data to persistence (out of the CPU cache). In other words:
pmem_memcpy_persist(dst, src, len) == memcpy(dst, src, len) + pmem_persist(dst, len)
- What do the terms object store, memory pool, and layout mean?
- Object store: Treats blobs of persistence as variable-sized objects (as opposed to files or blocks)
- Memory pool: Exposed memory mapped files
- Layout: A string of your choice that identifies a pool
- What causes libpmemobj to run slowly on SSDs?
The PMDK is designed and optimized for byte-addressable persistent memory while SSDs are block based. Running libpmemobj on SSDs requires translations from block to byte addressing. This adds additional time to a transaction. Also, it requires moving whole blocks from SSD to memory and back for reading and flushing writes.
- How does an application find objects in the memory mapped file when it restarts after a crash?
Libpmemobj defines memory mapped regions as pools and they are identified by a layout. Each pool has a known location called root, and all the data structures are anchored off of root. When an application comes back from a crash it asks for the root object, from which the rest of the data can be retrieved.
- Does libpmemobj support local and remote replication?
Yes, libpmemobj supports both local and remote replication through the use of the sync option on the pmempool command or the pmempool_sync() API from the libpmempool(3) library.
- What support is available for transactions that span multiple different memory pools?
There is no support for transactions that span multiple memory pools where each pool is of the same or a different type.
- How is concurrency handled in libpmemobj?
Libpmemobj maintains a generation number that gets increased each time a pmemobj pool is opened. When a pmem-aware lock is acquired, such as a PMEM mutex, the lock is checked against the pool's current generation number to see if this is the first use since the pool was opened. If so, the lock is initialized. So, if you have a thousand locks held and the machine crashes, all those locks are dropped because the generation number is incremented when the pools are open, and it is decremented when the pools are closed. This avoids having to find all the locks and iterate through them.
- Are pool management functions thread safe?
No. Pool management functions are not thread safe because we can't put the shared global state under a lock for runtime performance reasons.
- Is pmem_persist thread safe?
No. The role of pmem_persis is to ensure that the passed memory region gets out of the processor caches without regard to what is stored in the region. Store and flush are separate operations. To store and persist atomically, perform the locking around both operations manually.
- How do I determine the size of a pmemobj pool if I want to allocate N objects of specific size?
Libpmemobj uses roughly four kilobytes for each pool plus 512 kilobytes per 16 gigabytes of static metadata. For example, a 100 gigabyte pool would require 3588 kilobytes of static metadata. Additionally, each memory chunk (256 kilobytes) used for small allocations (less than or equal to two megabytes) uses 320 bytes of metadata. Also, each allocated object has a 64-byte header.
- How do I use pmempool when creating a large pool (over 100 GB)?
One way to ensure that you have persistent memory reserved before you use the pmempool is by using the command create. For more details, type the command man pmempool-create.
Create a 110 GB blk pool file.
$ pmempool create blk --size=110G pool.blk
Create the maximum allowed log pool file.
$ pmempool create log -M pool.log
- Is there support for multiple pools within a single file?
No. Having multiple pools in a single file is not supported. Our libraries support concatenating multiple files to create a single pool.
- How do I expand a persistent memory pool?
Persistent memory pools do not grow automatically after creation. You can use a holey file to create a large pool, and then rely on the file system to do everything else. However, this is often seen as unsatisfactory as it is contrary to how traditional storage solutions work. For details, see Runtime extensible zones.
- What is a good way to expand a libpmemobj pool?
PMDK libraries rely on file system capability to support sparse files. This means that you create a file as large as you could possibly want, and the actual storage memory use would be only what is actually allocated.
- Can I delete a memory pool using libpmemobj?
No. The pmemobj_close() function closes the memory pool and does not delete the memory pool handle. The object store itself lives on in the file that contains it and may be reopened later.
To delete a pool, use one of the following options:
- Delete the file from the file system that you memory mapped (object pool).
- Use the pmempool rm command.
PMDK - Hardware and Software
- Does PMDK work with other NVDIMMs?
Yes. PMDK is platform neutral and vendor neutral, although these libraries are optimized to perform the best on Intel® Optane™ DC persistent memory.
- Is the PMDK required to access NVDIMMs?
The PMDK is not a requirement but a convenience for adopting persistent memory programming. You can use the PMDK libraries as binaries, or you can choose to reference the code in the libraries if you are implementing persistent memory access code from scratch.
- Is PMDK part of any Linux* or Microsoft Windows* distributions?
Yes. The PMDK libraries, but not the tools, are included in Linux distributions from Suse*, Red Hat Enterprise Linux*, and Ubuntu*.
For Microsoft Windows, the PMDK libraries (but not the tools) are included in Windows Server* 2016 and Windows® 10. For details, see the pmem.io blog PMDK for Windows.
To get the complete PMDK, download it from the PMDK GitHub repository.
- Does PMDK support ARM64*?
Currently only 64-bit Linux* and Windows* on x86 are supported.
- Do block storage devices based on Intel® Optane™ technology (like Intel® Optane™ memory and Intel® Optane™ SSDs) support libpmem?
No. The PMDK is designed and optimized for byte-addressable persistent memory devices only.
Persistent Memory Over Fabric (PMOF)
- What is PMOF?
PMOF enables replication of data remotely between machines with persistent memory.
- What is the purpose of librpmem and rpmemd?
Librpmem and rpmemd implement persistent memory over fabric (PMOF). Librpmem is a library in the PMDK that will run on the initiator node and rpmemd is a new remote PMDK daemon that will run on each remote node that data is replicated to. The design makes use of the OpenFabrics Alliance (OFA) libfabric application-level API for the backend Remote Direct Memory Access (RDMA) networking infrastructure.
- How do I enable libpmemlog debug logs for my application?
Link the application using the -lpmemlog option. This option is optimized for performance, skips checks that impact performance, and never logs any trace information or performs any run-time assertions.
Include the following:
- Libraries under /usr/lib/PMDK_debug that contain run-time assertions and trace points
- The environment variable LD_LIBRARY_PATH, which is set to /usr/lib/PMDK_debug or /usr/lib64/PMDK_debug, depending on the debug libraries installed on the system.
The trace points in the debug version of the library are enabled using the environment variable PMEMLOG_LOG_LEVEL.