Recovery and Fault-Tolerance for Persistent Memory Pools Using Persistent Memory Development Kit (PMDK)

Introduction

Application reliability is important for ensuring protection against application downtime. In this article, I present pmempool, a tool available in the Persistent Memory Development Kit (PMDK) that helps prevent, diagnose, and recover from unplanned data corruption caused by hardware issues. I also show how errors can be injected to test the redundancy of your system. Everything shown in this article can be tested without having real persistent memory hardware by hardware emulation using DRAM.

I assume that you have a basic understanding of persistent memory concepts and are familiar with general PMDK features. If not, please visit the Intel® Developer Zone (Intel® DZ) Persistent Memory Programming site, where you will find the information you need to get started.

HDDs and SSDs protect against hardware failures using RAID (Redundant Array of Independent Devices) to maintain data integrity. The Intel® Optane™ DC persistent memory module is a DIMM form factor device that sits on the memory bus and can be configured in interleaved sets for better performance. Integrated memory controllers (IMCs) do not implement any type of RAID to protect against data integrity issues. Instead, servers and workstations use registered error-correcting code (ECC) memory to protect against certain hardware issues.

Since persistent memory pools expose the hardware directly to the applications using DAX (Direct Access), applications must be designed to handle data integrity issues caused by underlying memory hardware errors. Applications can access the media using device Dax (DevDax) or file system Dax (FSDax). DevDax is a character device whereas FSDax uses a Dax supported file system such as NTFS, XFS, or EXT4. This article focuses on the FSDax use case.

Validating Persistent Memory Pools

A good application assumes that hardware will fail at some point and corrupt or lose data. Error handling is an important part of overall application reliability. Fortunately for applications built using PMDK, there are tools available to provide off-line analysis and error recovery—although this is not guaranteed, as we will see next.

The pool used in the next two sections is a BLK (libpmemblk) pool with a block size of 1024 bytes, created using pmempool as shown here:

$ pmempool create -M blk 1024 /mnt/pmem/poolfile

The pmempool utility with the check command can be used to validate the integrity of metadata within a persistent memory pool or pool set (more about pool sets later) that is not in use by a running application. User data cannot be validated. The verbosity of the output can be increased using the -v option:

Listing 1. Executing pmempool check on a healthy persistent memory pool

$ pmempool check -v /mnt/pmem/poolfile
checking shutdown state
shutdown state correct
checking pool header
pool header correct
checking pmemblk header
pmemblk header correct
checking BTT Info headers
arena 0: BTT Info header checksum correct
checking BTT Map and Flog
arena 0: checking BTT Map and Flog
/mnt/pmem/poolfile: consistent

Listing 1 shows the result of checking a healthy pool. Since, in this example, /mnt/pmem/poolfile is a BLK pool type, the pmemblk header and block translation table (BTT) info headers are also checked. The pmempool check utility supports LOG (libpmemlog) and OBJ (libpmemobj) pool types and executes the appropriate checks.

Let's imagine now that some blocks get corrupted in the memory media where our pool resides. These corrupted blocks are called bad blocks. For experimentation purposes, we can inject some bad blocks and see what the result would be. We can do this in two ways: (1) using the ndctl tool or (2) directly writing to sysfs. In this article, we will use the latter. For information regarding ndctl, please refer to the NDCTL Users Guide.

We start by getting the starting block offset (inside the persistent memory device) of our pool file:

# filefrag -v -b512 /mnt/pmem/poolfile | grep -E "^[ ]+[0-9]+.*" | head -1 | awk '{ print $4 }' | cut -d. -f1
278528

The block size passed to filefrag is 512 bytes (-b512), which is the default block size used in FSDax. We can insert a bad block using the above offset, which corresponds to the first block of the pool (in this case, and since persistent memory is being emulated using DRAM, the device is /dev/pmem0):

# echo 278528 1 > /sys/block/pmem0/badblocks

The expected format for the badblocks file is the following: for every consecutive sequence of bad blocks reported by the device, a new line is created with the format offset size, where offset is the offset of the first block and size is the number of blocks affected in the sequence.

Realize that our hardware does not have bad blocks and that these changes are performed only in the runtime kernel state data. No I/O with our pool will be possible until we clean all the bad blocks:

# pmempool check -v /mnt/pmem/poolfile
Bus error (core dumped)
#
# cp /mnt/pmem/poolfile ./
cp: error reading '/mnt/pmem/poolfile': Input/output error
#
# dd if=/dev/pmem0 of=/dev/null
dd: error reading '/dev/pmem0': Input/output error
278528+0 records in
278528+0 records out
142606336 bytes (143 MB, 136 MiB) copied, 0.233683 s, 610 MB/s

We can clean the bad blocks by writing to them. How this cleaning is done by the underlying hardware is up to each manufacturer and out of the scope of this article.

Let's clean our injected bad block:

# cat /sys/block/pmem0/badblocks
278528 1
#
# dd conv=notrunc if=/dev/zero of=/dev/pmem0 oflag=direct bs=512 seek=278528 count=1
1+0 records in
1+0 records out
512 bytes copied, 0.000285411 s, 1.8 MB/s
#
# cat /sys/block/pmem0/badblocks
#

We have cleaned all the bad blocks. To do that, however, we have also overwritten our pool's first block with zeros. We should then recheck the health of our pool:

# pmempool check -v /mnt/pmem/poolfile
checking pool header
incorrect pool header
/mnt/pmem/poolfile: not consistent

pmempool check reports that the pool header is corrupted. If this were not bad enough, we just realized that we do not have a copy. Can we fix this?

Recovery

An unsuccessful recovery

To see if pmempool can recover our corrupted pool, we can pass -r (recovery) and -N (dry-run) to see what would be done to the pool to recover it without changing it. The current version of pmempool-check (PMDK v1.4) only supports recovery for libpmemblk and libpmemlog pools. libpmemobj pools are not yet supported.

# pmempool check -v -r -N /mnt/pmem/poolfile
checking pool header
incorrect pool header
pool_hdr.signature is not valid. Do you want to set it to PMEMBLK? [Y/n] Y
pool_hdr.major is not valid. Do you want to set it to default value 0x1? [Y/n] Y
setting pool_hdr.signature to PMEMBLK
setting pool_hdr.major to 0x1
invalid pool_hdr.poolset_uuid. Do you want to set it to 2a3b402a-2be0-46f0-a86d-7afef54b258a from BTT Info? [Y/n] Y
setting pool_hdr.poolset_uuid to 2a3b402a-2be0-46f0-a86d-7afef54b258a
the following error can be fixed using PMEMPOOL_CHECK_ADVANCED flag
invalid pool_hdr.checksum
/mnt/pmem/poolfile: cannot repair

Since the information storing the type of the pool was destroyed, pmempool asks if we want to set it to PMEMBLK, which is the default. We say yes. We also say yes to the other questions related to other types of pool metadata. In the end, however, the tool fails due to an invalid header checksum. It also gives us a hint that we can fix this problem by setting the PMEMPOOL_CHECK_ADVANCE flag. We do that by passing the -a option:

# pmempool check -v -r -N -a /mnt/pmem/poolfile
checking pool header
incorrect pool header
pool_hdr.signature is not valid. Do you want to set it to PMEMBLK? [Y/n] Y
pool_hdr.major is not valid. Do you want to set it to default value 0x1? [Y/n] Y
setting pool_hdr.signature to PMEMBLK
setting pool_hdr.major to 0x1
invalid pool_hdr.poolset_uuid. Do you want to set it to 2a3b402a-2be0-46f0-a86d-7afef54b258a from BTT Info? [Y/n] Y
setting pool_hdr.poolset_uuid to 2a3b402a-2be0-46f0-a86d-7afef54b258a
invalid pool_hdr.checksum. Do you want to regenerate checksum? [Y/n] Y
setting pool_hdr.checksum to 0xb199cec3475bbf3a
checking pmemblk header
pmemblk header correct
checking BTT Info headers
arena 0: BTT Info header checksum correct
checking BTT Map and Flog
arena 0: checking BTT Map and Flog
/mnt/pmem/poolfile: repaired

It seems that the tool can repair the pool metadata after all. We should go ahead and rerun the above command without the -N option.

# pmempool check -v -r -a /mnt/pmem/poolfile
...
/mnt/pmem/poolfile: repaired

There is one more thing to test: the consistency of the data stored in the pool. This is application-dependent and can't be tested by a general tool. For the case presented here, I created the pool by writing an array of around 1 million integers with values 0, 1, 2, …, approximately one million. The following listing shows the main loop of the program:

...
for (int i = 0; i < nelements; i++) {
        for (int j = 0; j < belements; j++) {
                buf[j] = i*belements + j;
        }
        pmemblk_write (pbp, (const void *) buf, i);
}
...

I check the consistency of the data by a similar program that reads back the integers and checks that the values correspond to the ones I wrote:

...
for (int i = 0; i < nelements; i++) {
        pmemblk_read (pbp, (void *) buf, i);
        for (int j = 0; j < belements; j++) {
                if (buf[j] != i*belements + j) {
                        printf ("content error for element %d\n", i*belements + j);
                        pmemblk_close (pbp);
                        return 1;
                }
        }
}
...

Unfortunately, the fixes done to our pool header by pmempool are not good enough this time. We can't even open the pool:

/mnt/pmem/poolfile: Invalid argument

A successful recovery

Let see now an example of a successful recovery from a block corruption. In this case, I am injecting a bad block in block number 7 (278528 + 7). Furthermore, I am skipping the step of injecting bad blocks by directly zeroing the targeted block.

# dd conv=notrunc if=/dev/zero of=/dev/pmem0 oflag=direct bs=512 seek=278535 count=1
...
# pmempool check -v /mnt/pmem/poolfile
checking shutdown state
shutdown state correct
checking pool header
incorrect pool header
/mnt/pmem/poolfile: not consistent

Now that the pool is corrupted, let's see if the repair will fix it:

# pmempool check -v -r -a /mnt/pmem/poolfile
checking shutdown state
shutdown state correct
checking pool header
incorrect pool header
invalid pool_hdr.checksum. Do you want to regenerate checksum? [Y/n] Y
setting pool_hdr.checksum to 0xadbac6304e1ae81a
checking pmemblk header
pmemblk header correct
checking BTT Info headers
arena 0: BTT Info header checksum correct
checking BTT Map and Flog
arena 0: checking BTT Map and Flog
/mnt/pmem/poolfile: repaired

In this case, it seems that only the checksum was incorrect. After the repair concludes, we check the data consistency by running the program reading back the integers. Everything works fine, which means that block 7 wasn't critical to the integrity and correctness of the metadata—and data—of the pool.

# ./checkpoolcontent.sh

****  read data from /mnt/pmem/poolfile...

file holds 3805282 elements
#

Here elements means blocks of 1024 bytes (256 integers) each. checkpoolcontent.sh is a very simple script calling the reading code:

#!/bin/bash
PMEMDEV=/dev/pmem0
PMEMMNT=/mnt/pmem
GREEN='\033[0;32m'
NC='\033[0m'

printf "\n${GREEN}****  read data from $PMEMMNT/poolfile...${NC}\n\n"
./readfromblkpool $PMEMMNT/poolfile

An impossible recovery (corrupting user data)

Let's finish this section with an example of an impossible recovery. It is impossible because pmempool cannot even recognize the injected bad block as a corruption. In this case, I am injecting a bad block in block number 24 (278528 + 24). Again, I am skipping the step of injecting bad blocks by directly zeroing the targeted block.

# dd conv=notrunc if=/dev/zero of=/dev/pmem0 oflag=direct bs=512 seek=278552 count=1
...
# pmempool check -v /mnt/pmem/poolfile
checking shutdown state
shutdown state correct
checking pool header
pool header correct
checking pmemblk header
pmemblk header correct
checking BTT Info headers
arena 0: BTT Info header checksum correct
checking BTT Map and Flog
arena 0: checking BTT Map and Flog
/mnt/pmem/poolfile: consistent

Since no metadata was affected, this time pmempool thinks everything is correct. If we check the contents of the pool, however, we discover that this is not the case:

# ./checkpoolcontent.sh

****  read data from /mnt/pmem/poolfile...

file holds 3805282 elements
content error for element 4096
#

Prevention Using Fault Tolerance

The common-sense approach to protect data is to do a backup. The challenge with backups, however, is how to keep all copies synchronized. If your pool's size is large, creating a copy—local or remote—may take a long time, making it hard to do it frequently. However, if we do not create copies frequently enough, we risk losing a lot of precious data.

Fortunately, PMDK supports data replication using pool sets. From an application perspective, a pool set is indistinguishable from a pool. Under the hood, a pool set is composed of multiple pool files. Pool sets are designed to serve two purposes: (1) extending the size of a pool when running out of space and (2) creating replicas to provide fault tolerance. I briefly show how you can create a pool set of each type here. For more information about pool sets, please read the official man page for poolset(5).

To create a pool set spanning multiple pool files, our mypool.set file would look something like this:

PMEMPOOLSET
100G /mountpoint0/myfile.part0
200G /mountpoint1/myfile.part1
400G /mountpoint2/myfile.part2

To create local replicas (remote replicas are also possible), we need to add a replica section:

PMEMPOOLSET
1GiB /mnt/pmem/poolfile
REPLICA
1GiB /mnt/pmem/poolreplica

The above pool set is also the one I am using for this example. It is important to mention that replication is done only when data is flushed explicitly. Even if our program does not crash and finishes correctly (and the data in the main pool is also correct), the data in the replica may not be correct. Explicit flushing is done, for example, when using the call pmemobj_persist() in libpmemobj or when exiting a transaction block.

Due to the importance of flushing to ensure the correctness of all replicas, it is always a good idea to use debugging tools, such as Intel® Inspector, to make sure that we are flushing appropriately. Intel Inspector is available as part of Intel® Parallel Studio XE and Intel® System Studio. If you are interested in Intel Inspector, please refer to How to Detect Persistent Memory Programming Errors Using Intel® Inspector.

Now that we are familiar with pool sets, let's corrupt one of the replicas and use the other to recover our set. The pool files corresponding to mypool.set are created using pmempool too:

# ls /mnt/pmem/
lost+found
# pmempool create obj --layout=my_layout mypool.set
# ls /mnt/pmem/
Lost+found  poolfile  poolreplica
#

The contents of the pool set are also filled with 1 million integers, similar to how we did it above. In this example I am creating a libpmemobj pool (using the C++ bindings) instead of a libpmemblk pool:

...
transaction::exec_tx (pop, [&] {
        proot->array = make_persistent<int[]> (size);
        for (int i = 0; i < size; i++) {
                proot->array[i] = i;
        }
});
...

At this point, we can inject a bad block on block 0 of pool file /mnt/pmem/poolfile:

# filefrag -v -b512 /mnt/pmem/poolfile | grep -E "^[ ]+[0-9]+.*" | head -1 | awk '{ print $4 }' | cut -d. -f1
278528
#
# dd conv=notrunc if=/dev/zero of=/dev/pmem0 oflag=direct bs=512 seek=278528 count=1
1+0 records in
1+0 records out
512 bytes copied, 0.000229725 s, 2.2 MB/s
#
# pmempool check -v mypool.set
replica 0 part 0: checking pool header
replica 0 part 0: incorrect pool header
poolfile.set: not consistent
# 

In order to fix it, we use pmempool sync. Before attempting a recovery, users must ensure that pools are not actively being used by a running application. The current version of pmempool-sync (PMDK v1.4) supports only libpmemobj pools:

# pmempool sync -v mypool.set
mypool.set: synchronized
#
# pmempool check -v mypool.set
replica 0: checking shutdown state
replica 0: shutdown state correct
replica 1: checking shutdown state
replica 1: shutdown state correct
replica 0 part 0: checking pool header
replica 0 part 0: pool header correct
replica 1 part 0: checking pool header
replica 1 part 0: pool header correct
mypool.set: consistent
#
# ./checkpoolsetcontent.sh

****  read data from ./poolfile.set...

#

After syncing and checking to make sure everything is ok, the data consistency check is run to make sure the data is also fine. It turns out that it is. No response means no problem.

Summary

In this article, I've reviewed pmempool, a tool available in the PMDK to prevent, diagnose, and recover from unplanned data corruption caused by hardware issues that has the potential to affect application reliability. I also demonstrated how you can inject errors to test the resiliency of your system without the need to wait for the real errors to manifest. The article covered validating and recovering using pmempool check, and prevention and recovery using pmempool sync.

About the Author

Eduardo Berrocal joined Intel as a cloud software engineer in July 2017 after receiving his PhD in computer science from the Illinois Institute of Technology (IIT) in Chicago, Illinois. His doctoral research focused on (but was not limited to) data analytics and fault tolerance for high-performance computing. In the past he worked as a summer intern at Bell Labs (Nokia), as a research aide at Argonne National Laboratory, as a scientific programmer and web developer at the University of Chicago, and as an intern in the CESVIMA laboratory in Spain.

Resources

  1. Persistent Memory Programming at Intel Developer Zone (IDZ)
  2. NDCTL Users Guide
  3. How to Emulate Persistent Memory
  4. Official manpage for poolset(5)
  5. Intel® Parallel Studio XE
  6. Intel® System Studio
  7. How to Detect Persistent Memory Programming Errors Using Intel® Inspector
For more complete information about compiler optimizations, see our Optimization Notice.